Active Learning, Data Selection, Data Auto-Labeling, and Simulation in Autonomous Driving — Part 4
It’s time for Tesla! Would be an interesting one!
Tesla
In this talk on Tesla AI Day in 2019, Andrej Karpathy explains the active learning procedure at Tesla, which they call the Data Engine. For example, in an object detection task and for a bike attached to the back of a car, the neural network should detect just one object (car) for downstream tasks such as decision-making and planning. Check the following image:
They find a few images that show this pattern and use a machine learning mechanism to search for similar examples in their fleet to fix this problem. The returned images from the fleet can be as follows:
Then human annotators will annotate these examples as single cars, and the neural network will be trained on these new examples. So, in the future, the object detector will understand that it is just an attached bike to a car and consider that as just a single car. They do this all the time for all the rare cases. So their model will become more and more accurate over time.
After collecting some initial data, the models are trained. Then, wherever the model is uncertain, or there is human intervention or disagreement between the human behavior and the model output, which is running in shadow mode, the data will be selected to be annotated by humans, and the model will be trained on that data. For example, if the model for lane line detection does not work very well in tunnels, they will notice a problem in tunnels. So they use the explained mechanism to find similar images, annotate those, and train the model on those.
Andrej also talks about their automated mechanism to do data labeling in addition to expensive human annotators, called Fleet Learning.
As an example, he talks about automatic cut-in detection. You drive on a highway in a lane, and someone from the left or right lane cuts into your lane. Their fleet is capable of detecting this scenario.
This is not a hard-coded procedure to write rules or codes to identify this event. It is a learning process. They ask the fleet to find scenarios where a car from the left or right lane transitions to the ego lane. Then they go backward in time and automatically annotate that scenario and train neural networks on it. So the neural network will pick up many of these patterns and learn from them. As Andrej says, the neural network may also learn about the left or right blinkers, which show lane change internally from these examples without any hard-coding. After the whole training procedure, they can deploy the model in shadow mode. So, in shadow mode, the network always makes predictions about the events and acts as a trigger to find mispredictions. These false positives and false negatives detected freely by the cut-in network will be added to the training dataset and go into the explained Data Engine and training procedure. After several rounds of the same process for the cut-in network, and when they are happy with the performance, the trained model can be turned on and take control of the car instead of being in shadow mode.
Here are some of the triggers they use:
On Tesla AI day in 2021, they discuss the Autopilot software stack in more detail. They also talk about data collection and auto-labeling. Andrej talks about collecting clean and diverse data for training neural networks. Instead of image space, they go for vector space, a 3D representation of everything you need for driving. It is the 3D position of lines, edges, curbs, traffic signs, traffic lights, cars, their positions, orientations, depth, velocity, etc. It gets raw image data and outputs the vector space for the scene. Part of the vector space is shown on the Tesla car screen in the following image:
Andrej also talks about data labeling in vector space instead of image space. The human annotator labels data in the 3D vector space, and it will be projected into image space automatically as shown in the GIF below:
But even this improved labeling procedure is not scalable. Active learning and auto-labeling can help here. Here is an example of how they label a clip. It has data from different sensors, such as cameras, IMUs, GPS, etc., and can last from 45 seconds to 1 minute.
Tesla engineering cars or customer cars can upload the clips to Tesla servers. Following this, many neural networks are run in offline mode to produce intermediate results, such as segmentation masks, depth, and point matching. Then a lot of robotics and AI algorithms are then utilized to create the final set of labels needed to train the neural networks.
Road surface labeling (segmentation) is one of the tasks they discuss. Typically, we can represent a road surface with splines or meshes, but due to topological restrictions, they are not differentiable and not appropriate for deep learning. Also, segmenting in each image space and for each camera view and then joining them together to represent the scene does not work very well. Instead, they use Neural Radiance Fields or NeRF and an implicit representation to represent the road surface (I LOVED this idea). You can check this, this, and this to learn more about NeRF.
Then they can query an (x, y) point on the ground, and the network can predict the height of the ground surface and the semantic class for that point, such as lane line, curb, asphalt, crosswalk, etc. By having the input (x, y) and the predicted height (z), we will have a 3D point that can be projected to all camera views. Making millions of these queries and projecting the resulting 3D point into all camera views can achieve semantic segmentation, such as the top-right picture in the above image. Then the resulting images from all camera views can be compared to the semantic segmentation results from image space and by a joint optimization for all camera views across space and time can produce an excellent reconstruction:
Each car that drives on some roads does the labeling. By collecting this labeled data from one or many cars in the same area, they can bring them all together and align them using various features such as road edges, lane lines, etc., which all agree with each other and also with the image space observations and can generate the whole scene. This produces an effective way of labeling the road surface. After the labeling process, human annotators can check them and clean up any noise in the data, or maybe add metadata to that to make it even more prosperous.
They also talk about more data collection such as static objects, walls, barriers, etc. We don’t go into more details here but you got the idea.
In addition to the auto-labeling procedure, explained above, Tesla has a simulator too and uses it for data generation and labeling. Here is their talk at Tesla AI Day 2021:
They mention that by using their vector space, they can produce the scenes they want very quickly! The simulator can be used in different cases:
To have such a good simulator, it needs accurate sensor simulation, photorealistic rendering, Diverse actors and locations, scalable scenario generation, real-world scenario reconstruction.
They also mentioned doing Reinforcement Learning using their simulator, which is my favorite topic and I’m doing my Ph.D. on that.
I think that’s enough for Tesla. Let’s go for another company.
Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.