LLMs in Autonomous Driving — Part 4

6 min readMar 29, 2024

Note: AI tools are used as assistants in this post!

Let’s continue with two important papers from Google:

PaLM-E: An Embodied Multimodal Language Model

The core concept behind PaLM-E is to directly integrate real-time observations (like images, state estimates, and sensor data) into the same representational space used by a pre-trained language model (LLM). These observations are encoded as sequences of vectors into the language embedding space and match the dimensionality of language tokens within the LLM. These vectors (from the word token embedder or a visual encoder) are then interleaved with normal embedded text tokens to form the prefix for the LLM. This allows PaLM-E to process visual and language information seamlessly.

As a decoder-only LLM, PaLM-E autoregressively generates text responses based on an input prefix or prompt. Notably, PaLM-E builds upon the existing PaLM (Chowdhery et al., 2022) language model, extending it with these embodied capabilities. PaLM-E takes both text and sensor data as input, combining them into multi-modal sentences. For example: “Q: What happened between <img_1> and <img_2>?” (where <img_i> represents an image embedding). PaLM-E then is trained on these multi-modal sentences and can produce text output, either answering questions or generating decision sequences to be executed by a robot (with the help of a low-level policy or planner).

PaLM-E bridges its text generation capabilities with embodied actions in two ways. For tasks like question answering or scene description, its text output directly provides the solution. However, for planning and control, PaLM-E generates text that can be used by a lower-level policy or controller. Essentially, PaLM-E acts as a high-level planner, directing low-level policies within a control loop.

RT-1: Robotics Transformer — for Real-World Control at Scale

A new robotic model called RT-1 has been introduced and shows great promise for real-world robotic control. This model is based on a Transformer architecture, which is also used in natural language processing. RT-1 can take instructions in natural language and perform actions in the real world. The authors tested RT-1 on a variety of tasks, including picking up objects, opening drawers, and following instructions. They found that RT-1 outperformed other models on many of these tasks. RT-1 can also be improved by incorporating data from different sources, such as simulations or other robots. Overall, the paper shows that RT-1 is a promising new model for real-world robotic control.

How Does RT-1 Work?

RT-1 is a multi-task model that tokenizes robot inputs and outputs actions. This means that it breaks down the robot’s inputs (such as camera images and task instructions) into smaller pieces, and then it generates outputs (such as motor commands) that correspond to those inputs.

RT-1 functions by taking a short sequence of images and a natural language task description as input. It then generates a corresponding action for the robot to perform at each time step. This process is accomplished through several key architectural components:

Visual Feature Extraction: First, RT-1 processes the images and text. It utilizes an ImageNet-pretrained convolutional neural network (EfficientNet) that has been conditioned on a pretrained instruction embedding using FiLM layers. This step extracts visual features directly relevant to the task. I will explain the FiLM layer later in the post.
Tokenization: The system employs a Token Learner module to compute a compact set of tokens from the extracted visual features.
Transformer Processing: A Transformer attends to these tokens, ultimately generating discretized action tokens.
Action Breakdown: Actions comprise seven dimensions for arm movement (x, y, z, roll, pitch, yaw, gripper opening), three dimensions for base movement (x, y, yaw), and a discrete dimension enabling switching between three modes: arm control, base control, and episode termination.
Closed-Loop Control: RT-1 operates in a closed-loop control fashion, issuing actions at 3Hz until the model generates a “terminate” action or a pre-set number of time steps is reached.

As I promised, let’s see what is the FiLM layer. The FiLM layer (Feature-wise Linear Modulation) plays a crucial role in the RT-1 model by allowing it to condition image processing based on the provided task instructions in natural language. Here’s a breakdown of its functionality:

Extracting Image Features: The RT-1 model first utilizes a pre-trained convolutional neural network (CNN) like EfficientNet. This CNN, trained on the vast ImageNet dataset, extracts general features from the input images.
Encoding Task Instructions: The natural language task instructions are converted into a numerical representation using a pre-trained embedding technique. This embedding essentially condenses the meaning of the instruction into a format the model can understand.
Dynamic Feature Modulation: The FiLM layer takes two inputs: a) The extracted image features from the CNN, and b)The pre-trained embedding of the task instruction.
Generating Scaling and Shifting Parameters: Based on the instruction embedding, the FiLM layer generates two sets of parameters for each feature map extracted by the CNN:
- Scaling Factor (gamma): This parameter controls the importance of each feature element. A higher gamma value emphasizes a particular feature, while a lower value reduces its influence.
- Shifting Parameter (beta): This parameter controls the overall representation of each feature element. A positive beta value shifts the feature towards positive values, while a negative value shifts it towards negative values.
Scaling and Shifting Features: Finally, the FiLM layer applies these scaling (gamma) and shifting (beta) parameters element-wise to the corresponding features extracted by the CNN. This process dynamically modulates the prominence and representation of different features within the image based on the specific task instructions.

Imagine FiLM as a smart filter that refines the image understanding capabilities of the CNN. It analyzes the task instructions and adjusts how the CNN interprets the image. By emphasizing relevant features and downplaying irrelevant ones, the FiLM layer ensures the model focuses on the crucial visual details necessary for successful action execution.

You can read more about the FiLM layer here.

RT-1 and Large Language Models

RT-1 is similar to large language models (LLMs) in several ways. Both models are able to learn from a large amount of data and to generalize their knowledge to new tasks. However, there are also some important differences between the two models. LLMs are typically trained on text data, while RT-1 is trained on robotic data. This means that LLMs are better at understanding and generating language, while RT-1 is better at controlling robots.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a clap and subscribe to my newsletter here.

LLMs in Autonomous Driving — Part 4

PaLM-E: An Embodied Multimodal Language Model

RT-1: Robotics Transformer — for Real-World Control at Scale

Written by Isaac Kargar