Home > News > AI

Adapting to Multiform Multi-Tasking: The Birth of the Most Powerful Open-Source Robot Learning System "Octopus"

Wed, May 29 2024 08:27 AM EST

Synced

Editor: Panda

In the field of robot learning, a common approach is to collect datasets tailored to specific robots and tasks, and then use them to train strategies. However, starting from scratch for each task by collecting sufficient data can lead to poor generalization of the resulting strategies.

In theory, experiences collected from other robots and tasks can offer potential solutions, allowing models to encounter a variety of robot control problems that may enhance the robot's generalization and performance in downstream tasks. Yet, even with the emergence of universal models capable of handling various natural language and computer vision tasks, constructing a "universal robot model" remains a formidable challenge.

Training a unified control strategy for robots is extremely difficult, involving numerous challenges such as operating different robot bodies, sensor configurations, action spaces, task specifications, environments, and computational budgets.

To achieve this goal, some research results related to "robot base models" have emerged; their approach involves directly mapping robot observations to actions and then generalizing to new domains or robots through zero-shot or few-shot learning. These models are often referred to as "generalist robot policies" (GRP), emphasizing the ability of robots to perform low-level visual motor control across a variety of tasks, environments, and robot systems.

For example, GNM (General Navigation Model) is suitable for various robot navigation scenarios, RoboCat can operate different robot bodies for task objectives, and RT-X can control five different robot bodies through language. While these models represent significant progress, they also have various limitations: their input observations are typically predefined and often limited (e.g., single-camera input video streams); they struggle to fine-tune effectively for new domains; and the largest versions of these models are not readily available for public use (which is crucial).

Recently, the Octo Model Team, composed of 18 researchers from the University of California, Berkeley, Stanford University, Carnegie Mellon University, and Google DeepMind, unveiled their groundbreaking research achievement: the Octo model. This project effectively overcomes the aforementioned limitations. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Faec7136dj00se70hw0016d000u0008hm.jpg&thumbnail=660x2147483647&quality=80&type=jpg

They have designed a system that allows the Generalist Robot Policy (GRP) to more easily handle the interface diversity of downstream robot applications.

At the core of this model is the Transformer architecture, which can map any input token (created based on observations and tasks) to an output token (encoded into actions). This architecture can be trained using diverse robot and task datasets. The policy can adapt to different camera configurations without additional training, control different robots, and be guided through language commands or target images - all achieved by simply changing the input model's tokens.

Most importantly, this model can also adapt to new robot configurations with different sensor inputs, action spaces, or robot morphologies by using appropriate adapters and fine-tuning with a small target domain dataset and minimal computational budget.

Furthermore, Octo has been pre-trained on the largest robot manipulation dataset to date, which includes 800,000 robot demonstrations from the Open X-Embodiment dataset. Octo is not only the first GRP that can be effectively fine-tuned to new observations and action spaces but also the first fully open-source strategy for generalist robot control (training workflows, model checkpoints, and data). The team also emphasizes the unique innovation in combining Octo's various components in the paper. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F588bfe50j00se70hw002ld000u000jlm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Octo Model

Let's take a look at how the open-source versatile robot strategy Octo is built. Overall, Octo is designed to be a flexible and widely applicable versatile robot strategy that can be used by a wide variety of downstream robot applications and research projects.

Architecture

At the core of Octo is the Transformer-based policy π. It consists of three key components: an input tokenizer, a Transformer backbone network, and a readout head.

As shown in Figure 2, the input tokenizer converts language commands, goals, and observation sequences into tokens. The Transformer backbone processes these tokens into embeddings, and the readout head produces the required output, which is the action. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F90fd679aj00se70hw002pd000u000ihm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Task and Observation Tokenizers

To convert task definitions (such as language instructions and target images) and observations (such as camera video streams) into a common tokenized format, the team employed different tokenizers for different modalities:

For language inputs, they first tokenize and then process them into a language embedding token sequence using a pre-trained Transformer. Specifically, the model they used is t5-base (111M).

For image observations and targets, they are processed through a shallow stack of convolutions, then split into a sequence of flattened image patches.

Finally, they construct the input sequence for the Transformer by adding learnable positional embeddings to the task and observation tokens and arranging them in a specific order.

Transformer Backbone and Readout Heads

Once the input is processed into a unified token sequence, it is ready for handling by the Transformer. This is similar to previous research efforts: training Transformer-based policies based on observation and action sequences.

Octo's attention mechanism is block-wise masked: observation tokens can only attend to task tokens and tokens from the same or previous time steps following a causal relationship. Tokens corresponding to non-existent observations are completely masked out (e.g., in datasets without language instructions). This modular design is convenient, allowing for the addition or removal of observations or tasks during fine-tuning.

In addition to these input token modules, the team inserted pre-learned readout tokens. Readout tokens attend to their preceding observation and task tokens but are not attended to by any observation or task tokens. Therefore, readout tokens can only read and process internal embeddings without influencing them. The role of readout tokens is similar to the [CLS] token in BERT, serving as a compact vector embedding of the observed sequence up to that point. For the embeddings of readout tokens, a lightweight implementation of a "action head" is used to predict a "chunk" consisting of multiple consecutive actions.

This design allows users to flexibly add new tasks and observation inputs or action output heads to the model during downstream fine-tuning. When adding new tasks, observations, or loss functions downstream, the pre-training weights of the Transformer can be retained overall, with only the addition of new positional embeddings, a new lightweight encoder, or parameters for new heads necessitated by specification changes. This differs from previous architectures where adding or removing image inputs or changing task specifications would require reinitializing or retraining a significant portion of the pre-trained model components.

Flexibility is crucial for Octo to truly become a "jack-of-all-trades" model: as it is impossible to cover all possible robot sensors and action configurations during pre-training, adjusting Octo's inputs and outputs during fine-tuning can make it a versatile tool in the robotics community. Furthermore, previous model designs using standard Transformer backbones or a fusion of visual encoders with MLP output heads locked in the types and order of model inputs. In contrast, switching Octo's observations or tasks does not necessitate reinitializing most of the model.

Training Data

The team utilized a mixed dataset containing 25 datasets from Open X-Embodiment. Figure 3 illustrates the composition of the dataset. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fbb9adc41j00se70hw001kd000ez00jim.jpg&thumbnail=660x2147483647&quality=80&type=jpg For more details on training objectives and training hardware configurations, please refer to the original paper.

Model Checkpoints and Code

Here's the highlight! The team not only published Octo's paper but also fully open-sourced all resources, including:

  • Pre-trained Octo checkpoints, including Octo-Small with 27 million parameters and Octo-Base with 93 million parameters.
  • Fine-tuning scripts for the Octo model, based on JAX.
  • Pre-training workflow for Octo on the Open X-Embodiment dataset, based on JAX. Data loaders for the Open X-Embodiment dataset, compatible with JAX and PyTorch.

Experiments

The team also empirically analyzed Octo through experiments, evaluating its performance as a foundational model for robots across multiple dimensions:

  1. Can Octo directly control multiple robot bodies and solve language and task objectives?
  2. Can Octo weights serve as high-quality initialization for efficient fine-tuning on new tasks and robots, and is it superior to training from scratch and common pre-training representations?
  3. In building a versatile robot policy, which design decisions in Octo are most crucial?

Figure 4 illustrates the evaluation of Octo across 9 tasks. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F4342f314j00se70hw002kd000u000com.jpg&thumbnail=660x2147483647&quality=80&type=jpg Directly control multiple robots using Octo.

The team compared the zero-shot control capabilities of Octo, RT-1-X, and RT-2-X, with results shown in Figure 5. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F75e695f3j00se70hw001bd000ew00dvm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Octo achieves a 29% higher success rate compared to RT-1-X (with 35 million parameters). When evaluated against WidowX and RT-1 Robot, Octo performs similarly to RT-2-X with 550 billion parameters.

Furthermore, while RT-1-X and RT-2-X only support language instructions, Octo also supports conditioning on target images. The team also observed that in WidowX tasks, using target images as conditions leads to a 25% higher success rate compared to using language as conditions. This could be because target images provide more task-relevant information.

Octo efficiently adapts to new domains using data.

Table 1 presents the experimental results of data-efficient fine-tuning. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fe97f2a7fj00se70hw0018d000u0005ym.jpg&thumbnail=660x2147483647&quality=80&type=jpg It can be seen that compared to training from scratch or using pre-trained VC-1 weights for pre-training, fine-tuning Octo yields better results. Across 6 evaluation settings, Octo shows an average advantage of 52% over the second-place baseline!

Furthermore, it is worth mentioning that for all these evaluation tasks, the formula and hyperparameters used for fine-tuning Octo were exactly the same, indicating that the team has found a very good default configuration.

Design decisions for versatile robot policy training

The results above indicate that Octo can indeed serve as a zero-shot multi-robot controller and as the basis for policy fine-tuning. Next, the team analyzed the impact of different design decisions on Octo's policy performance. Specifically, they focused on aspects such as model architecture, training data, training objectives, and model scale, conducting ablation studies for these.

Table 2 presents the results of ablation studies on model architecture, training data, and training objectives. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F772b05c0j00se70hw0016d000ex00avm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Figure 6 shows the impact of model size on zero-shot success rate, indicating that larger models have better visual scene understanding capabilities. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fcc5fe9a0j00se70hw0010d000ey00erm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Overall, the effectiveness of Octo's components has been demonstrated.