Behavior Prediction and Decision-Making in Self-Driving Cars Using Deep Learning — Part 2

Isaac Kargar

8 min readFeb 14, 2024

Let’s continue from the point we left last time.

https://kargarisaac.medium.com/subscribe

Behavior Prediction + Planner or Mid-to-Mid Driving

The next approach is to combine the Behavior Prediction and Planner modules.

ChauffeurNet

The next work is from Waymo. They tried to do prediction and planning together using one single neural network using Imitation Learning (IL).

They decided to use mid-level information from the Perception module and HDMap to create BEV images as the input for their model. You can see the different inputs they used as input:

It is easy to augment this type of representation and create some fake data for some corner cases like collisions, going off the road, …. You can see one example of creating a fake trajectory to teach the car to come back to the road when it is going off the road:

Using these augmented that’s their model is able to handle these cases. It also learns to avoid a parked car and nudge as you can see in the following gif:

They collected 30 million real-world expert driving examples, corresponding to about 60 days of continual driving. But again this amount of data is not enough to learn to drive using pure IL. They used some techniques to improve the performance of their model:

easy to synthesize data → add perturbations to the expert’s driving → create collisions and/or going off the road cases
Augment the imitation loss with losses that discourage bad behavior and encourage progress

Their model architecture is as follows:

The Road Mask Net is responsible to predict Road Mask and force the Feature Net to learn about this road mask concept. They also used the Perception RNN to predict the future motion of other agents and again force the Feature Net to learn this concept. Their model is a multi-task network that tries to learn a better representation of the scene by using multiple tasks. You can see the different loss terms they used in the figure.

Here are some other gifs for their model performance from here:

Perception + Behavior Prediction

The next approach is to do some perception tasks and behavior prediction together using one single neural network.

Fast and Furious

This work is from Uber ATG and the University of Toronto. They try to perform detections, tracking, and short-term motion forecasting of the objects’ trajectories using raw point cloud data. The network does these three tasks simultaneously in as little as 30 ms. The model can predict the trajectory for just 1s in the future. By using these three tasks together, multi-task learning somehow, each task can use the knowledge of other tasks to perform better for its own task.

The input for this model is a BEV created from point cloud data like this:

Quantize the 3D world to form a 3D voxel grid
Assign a binary indicator for each voxel encoding whether the voxel is occupied
Consider height as the third dimension like a channel in RGB images
Consider time as 4th dimension

The model is a single-stage detector that takes the created 4D input tensor and regresses directly to object bounding boxes at different timestamps without using region proposals.

They propose two fusion versions to exploit the temporal dimension:

1- Early fusion

Aggregates temporal information at the very first layer.
Fast as using the single frame detector.
Lacks the ability to capture complex temporal features as this is equivalent to producing a single point cloud from all frames, but weighting the contribution of the different timestamps differently.
Uses 1D convolution with kernel size n on temporal dimension to reduce the temporal dimension from n to 1

2- Late fusion

Gradually merges the temporal information. This allows the model to capture high-level motion features.

Similar to SSD, they use multiple predefined boxes for each feature map location. There are two branches after the above-computed feature map:

One is for binary classification to predict the probability of being a vehicle for each pre-allocated box.
One to predict (regress) the bounding box over the current frame as well as n − 1 frames into the future → size and heading

Here are some examples of their results:

IntentNet: Learning to predict intention from raw sensor data

The next work is again from Uber ATG and the University of Toronto. It is actually an extension of the previous work. Here, in addition to the BEV generated from the point cloud, they use HDMap info and fuse the extracted information from both, the point cloud and HDMap, to do detection, intention prediction, and trajectory prediction.

The inputs and outputs of the network are as follows:

inputs:

1- Voxelized LiDAR in BEV → height and time are stacked in the channel dimension → use 2D conv
2- Rasterized Map (both static and dynamic info) → 17 binary masks used as map features

Outputs:

1- Detected objects
2- Trajectory
3- High-level discrete intention: multi-class classification with 8 classes: keep lane, turn left, turn right, left change lane, right change lane, stopping/stopped, parked, other

The network architecture is as follows:

As you see in the image, there are two branches to process data from BEV and HDMap and then fuse them and finally three heads for tasks.

Here is an example of their results:

End-to-End

The final approach is to use one single neural network for all tasks to get raw sensor data as input and generate control commands which is call it End-to-End.

Learning to Drive in a Day

This work is from Wayve.ai which is one of my favorite startups. Their mission is to solve the self-driving problem end-to-end. They have several interesting works that show promising results that an end-to-end approach can learn how to drive like a human.

In this work, they used RL to train a driving policy to follow a lane from scratch in less than 20 minutes!! Without any HDMap and hand-written rules!!! This is the first example where an autonomous car has learned online, getting better with every trial. They used the DDPG algorithm and used a single monocular camera image as input to the network to generate control commands like steering and speed.

The learning procedure is completely done using one onboard GPU. The network architecture is a deep network with 4 convolutional layers and 3 fully connected layers with a total of just under 10k parameters and the reward signal is the distance traveled by the vehicle without the safety driver taking control.

Here is a video of the training procedure they published on their website:

source

Learning to Drive Like a Human

The next work is again from Wayve.ai. In this work, they tried to use Imitation learning in addition to Reinforcement Learning. They first copy the driving skills from the expert driver using IL and then use RL to fine-tune it and learn from the safety driver interventions and correction signals.

They used the sat-nav command in addition to camera images, compared to the previous work, and output control commands:

They used some auxiliary tasks like segmentation, depth estimation, and optical flow estimation to learn a better representation of the scene and use it to train the policy.

Here are two examples of the performance of their model which is able to handle two complex scenarios, an intersection, and a narrow road:

And finally, this is a brief explanation of their work:

Summary

We reviewed some different interesting works that used deep learning to solve the self-driving car problem. We divided different approaches based on the usage of ML and DL in which modules, from one single module to several modules, and finally used ML for full-stack and end-to-end approaches. I hope we see more interesting works in the future. Let’s see.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

References

Cui, H., Radosavljevic, V., Chou, F. C., Lin, T. H., Nguyen, T., Huang, T. K., … & Djuric, N. (2019, May). Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 2090–2096). IEEE.
Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou, F. C., Lin, T. H., … & Schneider, J. (2020). Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving. In The IEEE Winter Conference on Applications of Computer Vision (pp. 2095–2104).
Xia, T., & Han, Z. (2018). Path Planning using Reinforcement Learning and Objective Data (Master’s thesis).
http://oas.voyage.auto/scenarios/intersections.html
Bansal, M., Krizhevsky, A., & Ogale, A. (2018). Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079.
Luo, W., Yang, B., & Urtasun, R. (2018). Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 3569–3577).
Casas, S., Luo, W., & Urtasun, R. (2018, October). Intentnet: Learning to predict intention from raw sensor data. In Conference on Robot Learning (pp. 947–956).
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J. M., … & Shah, A. (2019, May). Learning to drive in a day. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 8248–8254). IEEE.
https://wayve.ai/blog/driving-like-human#:~:text=It%20learns%20end%2Dto%2Dend,to%20improve%20our%20driving%20policy.