Active Learning, Data Selection, Data Auto-Labeling, and Simulation in Autonomous Driving — Part 2

Isaac Kargar
5 min readFeb 14, 2024

--

Now, let’s look at what big companies do in the real world.

NVIDIA

NVIDIA proposes to use pool-based active learning and an acquisition function based on a disagreement between several trained models (the core of their system is an ensemble of object detectors providing potential bounding boxes and probabilities for each class of interest) to select the frames which are most informative to the model. Here are the steps in their proposed approach:

  1. Train: Train N models initialized with different random parameters on all currently labeled training data.
  2. Query: Select examples from the unlabeled pool using the acquisition function.
  3. Annotate: Annotate selected data by a human annotator.
  4. Append: Append newly labeled data to training data.
  5. Go back to 1.
source

They assume the object detector generates a 2D probability map for each class (bicycle, person, car, etc.). Each position in this map relates to a pixel patch in the input image, and the probability indicates whether an object of that class has a bounding box centered there. This type of output map is commonly found in one-stage object detectors like SSD or YOLO. They use the following scoring functions:

  • Entropy: They compute the entropy of the Bernoulli random variable at each position in the probability map of a specific class:

where p_c represents probability at position p for class c.

  • Mutual Information (MI): This method makes use of an ensemble E of models to measure disagreement. First, the average probability over all ensemble members for each position p and class c is computed as:

where |E| is the cardinality of E. Then, the mutual information is computed as:

MI encourages the selection of uncertain samples with high disagreement among ensemble models during the data acquisition process.

  • Gradient of the output layer (Grad): This function computes the model’s uncertainty based on the magnitude of “hallucinated” gradients. The predicted label is specifically assumed to be the ground-truth label. The gradient for this label can then be calculated, and its magnitude can be used as a proxy for uncertainty. Small gradients will be produced by samples with low uncertainty. Then a single score based on the maximum or mean variance of the gradient vectors is assigned to each image.
  • Bounding boxes with confidence (Det-Ent): The predicted bounding boxes of the detector have an associated probability. This allows computing the uncertainty in the form of entropy for each bounding box.
source

To aggregate the scores obtained via the mentioned techniques above, two approaches can be used: taking the maximum or the average. Taking the maximum can lead to outliers (because the probability at a single position determines the final score), whereas taking the average favors images with a high number of objects (since an image with a single high uncertainty object might get a lower score than an image with many low uncertainty objects). For the maximum, the score is defined as follows:

Then the top N images with the highest scores can be selected to be annotated by human annotators. It is also possible to first calculate a feature vector for each image using for example a CNN network, and then using some techniques using k-means++, Core-set, or sparse-modeling (OMP), which combines diversity and uncertainty, to select samples.

source

Here is a short video of the performance of their approach:

In addition to the mentioned active learning approach, NVIDIA has developed another solution to deal with the challenges of data collection and labeling. They discussed this challenge at CES2022, stating that the most interesting data that we are interested in labeling is also the most challenging. Here are some examples of dark, blurry, or hazy scenes.

There are also scenes that are difficult to comprehend, such as occluded vehicles or pedestrians. Moreover, there are some scenes not often seen, like construction zones. Due to these issues, autonomous vehicle developers need a mix of real and synthetic data. NVIDIA built DRIVE Sim replicator to deal with all of these labeling challenges. DRIVE Sim is a simulation tool built on Omniverse that takes advantage of the platform’s features. Engineers can generate hard-to-label ground truth data from this replicator and use them to train deep neural networks that make up autonomous vehicle perception systems. As a result of synthetic data, developers have more control over data development, allowing them to tailor it to their specific needs and collect the data they want even before any real-world data collection for those situations.

As collecting and labeling real-world data is difficult, taking synthetic data and augmenting it with real-world data removes this bottleneck, allowing engineers to take a data-driven approach to develop autonomous driving systems. This improves real-world results and significantly speeds up AV development.

Another problem is the gap between the simulator world and the real world. The gap may be pixel-level or content-level. Omniverse Replicator is designed to narrow the appearance and content gaps. Read more about the capabilities of this amazing simulator here.

That’s it for Nvidia. We will go for Waymo in the next blog post.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet