Paper Review: Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

4 min readJun 23, 2019

These days, self-supervised learning is becoming a hot research area in different fields, especially in robotics. In this post, I will review a paper that uses this approach for robotic manipulation tasks (peg insertion) and took the best paper award of ICRA 2019.

Introduction

This work uses both vision and haptic feedback:

Visual feedback: provides semantic and geometric object properties for accurate reaching or grasp pre-shaping. It’s worth to mention that there are a lot of works that try to extract such properties through tactile sensors and exploratory procedures. Some features such as shape, size, weight, texture, friction, and so on.
Haptic feedback: provides observations of current contact conditions between object and environment for accurate localization and control under occlusions.

They used self-supervision and neural networks to learn a multimodal compact representation of the sensory inputs. This network will learn to predict optical flow, whether contact will be made in the next control cycle, and concurrency of visual and haptic data. They used action-conditional training to encourage the encoding of action-related information. Then this compact representation can be used as an input to a policy for contact-rich manipulation task that is learned through deep reinforcement learning. In this way, they decoupled the state representation and control modules in their work.

Problem Statement

They modeled the manipulation task as a finite-horizon, discounted MDP that the state is the low-dimensional representation learned from high-dimensional visual and haptic sensory data and the action is a continuous value and 3D displacement in Cartesian space.

Multimodal Representation Learning

They proposed a set of predictive tasks for learning visual and haptic representations for contact-rich manipulation tasks. In this way, supervision can be obtained automatically rather that manual labeling.

For the first module (representation learning module), which is trained end-to-end using self-supervision, they use three types of sensory data as input: RGB image, force/torque reading over a 32ms window, and end-effector position and velocity. It encodes and fuses these data using neural networks into a 128-d multimodal representation.

The network has nearly half a million learnable parameters and needs a large amount of labeled training data. They designed procedures to generate labels automatically through self-supervision. They use the next robot action (to encode action-related information) and the compact representation of current sensory data and the model has to predict the optical flow generated by the action and whether the end-effector will make contact with the environment in the next control cycle. The ground-truth annotations for optical flow are generated automatically using data from the joint encoders of the arm and known robot kinematics and geometry, and ground-truth annotations for binary contact states are generated using applying simple heuristics on the force/torque readings.

They also used a third representation learning objective that predicts whether two sensor streams are temporally aligned. That’s because there is concurrency between the different sensory streams leading to correlations and redundancy, e.g., seeing the peg, touching the box, and feeling the force. They sample a mix of time-aligned multimodal data and randomly shifted ones to generate data for the training phase.

Policy Learning and Controller

For the policy learning and controller, they use model-free reinforcement learning (TRPO), because modeling of contact interactions and multi-contact planning result in a complex optimization problem. In addition, by using model-free RL, they eliminate the need for an accurate dynamics model.

The policy network is a 2-layer MLP that takes as input the 128-d multimodal representation and produces a 3D displacement of the robot end-effector. To train the policy efficiently, they freeze the representation model parameters during policy learning. Then, the controller will take the end-effector displacement and outputs direct torque commands to the robot.

Conclusion

In this post, we saw that using multimodal sensory data, instead of using vision-only data, can be helpful for robotic manipulation tasks which need to interact with the world to learn about the environment and object. This approach can be used for other robotic tasks such as grasping and in hand manipulation.