LLMs in Autonomous Driving — Part 3

7 min readFeb 18, 2024

Note: AI tools are used as assistants in this post!

In this part, we will review the DriveGPT4 paper. Let’s get started!

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

Self-driving cars have gone from science fiction to rapidly approaching reality. While much of the focus is on making these vehicles safe and reliable, what about making them understandable and interpretable? A new solution called DriveGPT4 is taking a big step forward in helping us understand how autonomous vehicles make decisions.

What is DriveGPT4?

DriveGPT4 is an autonomous driving system built on large language models (LLMs). While chatbots specialize in conversation, DriveGPT4 has been trained to process videos as well. This means it can “see” what the car’s cameras see and interpret that visual information through a language-based model. Basically, it is a Multi-Modal LLM (MLLM).

The core idea behind DriveGPT4 is improving the interpretability of self-driving systems. Put simply, the goal is to have an autonomous vehicle not only capable of driving itself, but also able to explain why it's making the choices it is.

How DriveGPT4 Works

Seeing: DriveGPT4 takes video footage from the car’s cameras, breaks it down, and then converts it into the same kind of text-based data that its language model understands.
Understanding: The system has been trained on a massive dataset of driving videos paired with explanations, similar to how someone might learn to drive by watching lots of training videos with expert commentary.
Decision-Making: DriveGPT4 generates instructions for controlling the car (speed, steering) but also creates natural language explanations for its actions. It can answer human questions like “Why did you slow down there?” or “What’s happening in the next lane?”

Making Driving Safer and More Trustworthy

The ability of DriveGPT4 to explain itself offers several benefits for self-driving cars:

Trust: If passengers can understand the ‘thinking’ behind the car’s decisions, they’re more likely to feel comfortable and trust the technology.
Debugging: In cases of potential accidents or incidents, having the car explain itself can help quickly identify problems in the system’s reasoning. This facilitates improvements and debugging.
Learning: This explanation ability may help human drivers learn from how a self-driving car processes situations, potentially making them better drivers as a result.

Dataset

The researchers behind DriveGPT4 created a new dataset specifically to train their system for “interpretable end-to-end autonomous driving.” This means the dataset is made to help the AI do two key things:

Control the car: It teaches the model to predict accurate driving instructions (speed, turning angle) based on what the car “sees” through its camera.
Explain its reasoning: It teaches the model to generate human-like explanations for the driving choices it’s making and understand a wide variety of questions a user might ask.

The dataset is built on an existing self-driving car dataset called BDD-X. Here’s what they used and added:

1. BDD-X Dataset

Purpose: The foundation of the dataset; provides real-world driving data essential for training an autonomous driving system.
Content:
— Videos: Footage captured from a car’s front-facing camera, simulating the car’s “vision”.
— Control Signals: Data for each video frame on human steering actions (angle) and speed. The AI learns to predict these.
— Text Annotations: Simple human-written descriptions and justifications for driving actions in each video (e.g., “Car is turning left”, “Yielding to oncoming traffic”).
How it’s created: Collected from real-world driving sessions with human drivers providing both the video input and control actions.

2. BDD-X Question-Answerings (QAs)

Purpose: Extends the BDD-X dataset to start training the AI’s language explanation ability.
Content: Pairs of questions and answers based directly on the text annotations in the BDD-X dataset. There are three types of QA pairs:
— Action description: “What is the current action of the vehicle?”
— Action justification: “Why does this vehicle behave in this way?”
— Control signal prediction: “Predict the speed and turning angle of the vehicle in the next frame.”
How it’s created:
— Questions are created as variations of the three basic forms stated above.
— Answers are taken directly from the corresponding BDD-X text annotations.

3. Additional QAs Generated by ChatGPT

Purpose: Drastically increases the diversity and complexity of explanations and questions that the AI is exposed to, making it more “conversational” and better able to interpret open-ended questions.
Content: A much wider range of questions and discussion about the videos, including:
— Descriptions of traffic conditions and surroundings
— Hypothetical situations and questions requiring inference
— Requests for stories or creative explanations
How it’s created: Here’s a simplified example and sudo code of the process:

def generate_chatgpt_qa(video, bddx_labels, object_detector):
    detected_objects = object_detector.detect(video)  # Use YOLOv8 or similar to find cars, signs, etc.
    chatgpt_prompt = create_prompt(video, bddx_labels, detected_objects) 
    qa_pairs = chatgpt.generate_conversations(chatgpt_prompt) 
    return qa_pairs

Here is one example of how ChatGPT is used:

Architecture

DriveGPT4 is designed to process both videos and text, ultimately translating a video of driving into human-understandable text explanations and driving control signals (speed, steering). Here’s how it works in broad strokes:

Video Tokenizer: Breaks down video input into manageable chunks and converts visual information into a text-like format the language model can understand.
LLM (LLaMA 2): The core language model processes the video tokens along with any text input, making predictions about actions, explanations, and control signals.
De-tokenizer: Translates the LLM’s output back into two things:

Human-readable text explaining the vehicle’s actions.
Control signals encoded in a predetermined format, directly usable by the self-driving system.

The Tokenizers

Tokenizers are like translators between different types of data the model uses:

Video Tokenizer

Input: A video is split into individual frames.
CLIP Encoder: This is a specialized tool (Radford et al., 2021) that analyzes each image frame, extracting key visual features.
Projector: Transforms these visual features into text-like tokens, similar to how words make up a sentence. Think of this step as creating visual “words” to describe the video.

The video tokenizer doesn’t just output any text — it finds text specifically designed to fit into the language model’s understanding. Here’s a simplified process:

Visual “Vocabulary”: Imagine the language model already knows a vast vocabulary of text tokens (like individual words and word pieces). The video tokenizer’s job is to create matching visual “words.”

CLIP to the Rescue: CLIP is a powerful tool for connecting images and text. It works by:

Embedding: Extracting key features from images and text, converting them into numerical representations (a.k.a embeddings).
Similarity Scores: Calculating how similar an image embedding is to different text embeddings in its database.

Selecting Tokens: For each video frame:

Pass the frame through CLIP to get an image embedding.
Compare similarity scores between the image embedding and the language model’s known text tokens.
Choose the text tokens with the highest similarity as the representation of that video frame.

2. Text Tokenizer (LLaMA)

Input: Text inputs (like questions you ask the system) and text-based control signals.
Output: Breaks down this text into smaller units (‘tokens’) the LLM understands. It’s akin to how we understand sentences by individual words or even their components.

3. De-tokenizer (LLaMA)

Input: Tokenized predictions from the LLM.
Output: Reconstructs these into meaningful text for humans to read and control signals formatted for the car’s systems to use.

Training Procedure

Stage 1: Pretraining — Learning Visual “Words”

Data:
CC3M: 593K image-text pairs
WebVid-2M: 703K video-text pairs
Focus: Aligning video representation with how the language model “thinks” about text. This is like creating a visual-to-text dictionary for the model.
Why Generic Data? Teaches the foundational patterns of how visuals and language relate, without focusing on driving yet.
What Gets Trained: Only the “projector”, the component that translates video features into text-like tokens.

Stage 2: Mix-Finetuning — Specializing in Driving and Explanations

Data
Driving Specific: 56K video-text pairs focused on driving actions with explanations (created with help from ChatGPT!). This teaches DriveGPT4 the vocabulary and rules specific to driving behavior.
General Visual Data: 223K instruction-following pairs from other similar research work. This keeps the model sharp on understanding diverse visual scenarios, not just roads.
Focus: Teaching DriveGPT4 to:
Predict control signals for self-driving
Generate human-like explanations for its actions
Answer a wide range of questions about driving scenes
What Gets Trained: Now, both the LLM (the core language piece) and the projector are further refined to learn this specialized task.

Why Use Different Data in Stages?

Foundation: Starting with general data creates a strong base of video-to-text understanding before getting into the complexities of driving.
Specialization: Finetuning with driving-specific data makes DriveGPT4 an expert in the language of roads, cars, and traffic.
Robustness: Adding in general visual data prevents DriveGPT4 from being too narrow. It’ll still understand things like unusual objects or scenery that might occur in real-world driving.

Here is one example of usage of DriveGPT4:

Some more examples of the test dataset:

Please check the paper for more detailed explanations and experiments.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a clap and subscribe to my newsletter here.