Home > News > AI

Tsinghua, Huawei, and Others Propose iVideoGPT: Specializing in Interactive World Models

Wed, May 29 2024 07:39 AM EST

Reported by Synced

In recent years, generative models have made significant progress, with video generation emerging as a new frontier. One important application of these generative video models is unsupervised learning on diverse internet-scale data to construct predictive world models. These world models are expected to accumulate commonsense knowledge about how the world operates, enabling the prediction of potential future outcomes based on the behavior of intelligent agents.

By leveraging these world models, intelligent agents based on reinforcement learning can engage in imagination, reasoning, and planning within the world model, allowing them to acquire new skills more safely and effectively in the real world through minimal experimentation.

While there is a fundamental connection between generative models and world models, a significant gap still exists between the development of generative models for video generation and world models for agent learning. One of the main challenges is achieving the optimal balance between interactivity and scalability.

In the field of model-based reinforcement learning, world models primarily utilize recurrent network architectures. This design enables the passing of observations or latent states based on actions at each step, facilitating interactive behavior learning. However, these models mostly focus on games or simulated environments, with limited capacity to model large-scale complex in-the-wild data.

In contrast, internet-scale video generation models can synthesize realistic long videos that can be controlled through text descriptions or future action sequences. While such models allow for high-level, long-term planning, their trajectory-level interactivity does not provide intelligent agents with sufficient granularity to effectively learn precise behaviors as fundamental skills.

Researchers from Tsinghua University, Huawei Noah's Ark Lab, and Tianjin University have proposed iVideoGPT (Interactive VideoGPT), an extensible autoregressive Transformer framework that integrates multimodal signals (visual observations, actions, and rewards) into a series of tokens, enabling interactive experiences for intelligent agents by predicting the next token.

iVideoGPT employs innovative compression tokenization technology to effectively discretize high-dimensional visual observations. With its scalable architecture, researchers can pre-train iVideoGPT on millions of human and robot operation trajectories, establishing a versatile foundation that can be used as an interactive world model for various downstream tasks. This research advances the development of interactive universal world models. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F7412602bj00se7090001kd000sg00bfm.jpg&thumbnail=660x2147483647&quality=80&type=jpg

Method

In this section, the research team introduces a scalable world model architecture called iVideoGPT, which exhibits high flexibility by integrating multimodal information such as visual observations, actions, rewards, and other potential inputs.

The core of iVideoGPT includes a compression tokenizer for discretizing video frames, along with an autoregressive transformer for predicting subsequent tokens. By pretraining on diverse video data, this model can acquire extensive world knowledge and effectively transfer it to downstream tasks. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fe6e53c99j00se7090003ad000u000htm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Architecture ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fe8f9e03cj00se7091008ad000ps014gm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Pretraining

Large language models can acquire extensive knowledge from internet text through self-supervised learning, such as predicting the next word. Similarly, the action-free video pretraining paradigm of the world model uses video prediction as a pretraining objective to provide internet-scale supervision for the physical world knowledge lacking in LLMs.

Researchers pretrained iVideoGPT on this common objective, employing cross-entropy loss to predict subsequent video tokens. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F741604a9j00se70900007d000u0002km.jpg&thumbnail=660x2147483647&quality=80&type=jpg Pre-training Data. Despite the abundance of videos available on the internet, researchers specifically pre-trained iVideoGPT for the field of robotic manipulation due to computational constraints. They utilized a mix of 35 datasets from the Open X-Embodiment (OXE) and Something-Something v2 (SSv2) datasets, totaling 1.5 million trajectories.

Fine-tuning

Action-conditioned and reward prediction. The team's architecture was designed to flexibly integrate additional modalities to learn interactive world models, as shown in Figure 3b. Actions are integrated by linear projection and added to slot token embeddings. For reward prediction, they did not train separate reward predictors but added a linear head on the hidden state of the last token of each observation.

This multi-task learning approach enhances the model's focus on task-relevant information, thereby improving the accuracy of control task predictions. In addition to the cross-entropy loss in Equation (3), they also used mean squared error loss for reward prediction.

Tokenizer adaptation. The research team chose to update the entire model, including the tokenizer, to adapt to downstream tasks and found this strategy to be more effective than parameter-efficient fine-tuning methods.

There is limited literature on using the VQGAN tokenizer for domain-specific data. In this work, due to decoupling dynamic information from context conditions through tokenization and assuming that the model may encounter unseen objects in downstream tasks, such as different types of robots, the basic physical knowledge learned by the transformer from diverse scenes – such as motion and interaction – is shared.

This assumption was supported by experiments as they transferred iVideoGPT from mixed pre-training data to unseen BAIR dataset, where the pre-trained transformer could zero-shot generalize predictions of natural motions, requiring only fine-tuning of the tokenizer for unseen robot grippers (see Figure 7). This feature is particularly crucial for scaling GPT-like transformers to large sizes, enabling lightweight cross-domain alignment while maintaining the integrity of the transformer. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fb6d728a7j00se7090002cd000u000b9m.jpg&thumbnail=660x2147483647&quality=80&type=jpg Experiment

As shown in Table 1, iVideoGPT demonstrates competitive performance compared to SOTA methods, while incorporating interactivity and scalability into its architecture. Although initial experiments were conducted at a low resolution of 64×64, iVideoGPT can easily scale up to 256×256 in RoboNet. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F6c2e1b54j00se7090002wd000u000dtm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Please refer to Figure 9 for qualitative results. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F8309d6cdj00se7090003ed000u000glm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Figure 4 illustrates the success rates of iVideoGPT compared to baseline models. iVideoGPT significantly outperforms all baselines in two RoboDesk tasks and achieves comparable average performance to the strongest model, SVG'. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2F7aa8591bj00se7090001wd000u000dym.jpg&thumbnail=660x2147483647&quality=80&type=jpg Figure 6 illustrates that model-based algorithms not only improve sample efficiency compared to model-free algorithms but also achieve performance on par with or surpassing DreamerV3. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fea8f3fb4j00se7090002qd000u000hzm.jpg&thumbnail=660x2147483647&quality=80&type=jpg The following research analyzed the zero-shot video prediction capability of the large-scale pre-trained iVideoGPT on the unseen BAIR dataset. Interestingly, researchers observed in the second row of Figure 7 that iVideoGPT predicted the natural motion of a robot gripper without fine-tuning, despite differences from the pre-training dataset. This indicates that although the model's zero-shot generalization ability on completely unseen robots is limited due to the lack of diversity in the pre-training data, it effectively separates scene context from motion dynamics. In contrast, using an adapted tokenizer, the non-fine-tuned Transformer successfully transferred pre-training knowledge and predicted the motion of a novel robot in the third row, providing a perceptual quality similar to the fully fine-tuned Transformer in the fourth row, as shown in quantitative results in Figure 8a. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0528%2Fe418a33bj00se7090001xd000u000bnm.jpg&thumbnail=660x2147483647&quality=80&type=jpg For more details, please refer to the original paper.