Home > News > It

Renmin University's Lu Zhiwu: It's not that difficult to surpass Sora as long as we have more computing power

Heng Yu Sat, May 04 2024 06:50 AM EST

A team from Renmin University has clashed with OpenAI three times!

The first clash was with Clip, the second was with GPT-4V, and the latest one was with Sora:

Last May, they collaborated with Berkeley, Hong Kong University, and other institutions to publish a paper on VDT on arXiv.

At that time, the team proposed and adopted the Diffusion Transformer in their technical architecture. Additionally, VDT introduced unified spatiotemporal masking modeling in the model.

This team is led by Professor Lu Zhiwu from the School of Artificial Intelligence at Renmin University.

Sora has been around for over two months now. How is the progress of this domestic team in the field of video generation? When can we expect the stunning moment of domestic Sora?

At the China AIGC Industry Summit, Lu Zhiwu shared his thoughts on the above questions without reservation. s_748bef3bc8be4b038330b2776eceb360.jpg To fully reflect Lu Zhiwu's thoughts, Quantum Bit has edited and organized the speech content without changing the original meaning, hoping to bring you more inspiration.

The China AIGC Industry Summit, organized by Quantum Bit, gathered 20 industry representatives for discussions. With nearly a thousand offline attendees and 3 million online viewers, the summit received widespread attention and coverage from mainstream media.

Key Points:

  • VDT uses Transformer as the base model to better capture long-term or irregular temporal dependencies.
  • The Scaling Law is a significant reason for the shift of video generation models from Diffusion-based to Transformer-based.
  • VDT employs a spatiotemporal separated attention mechanism, while Sora uses a spatiotemporal unified attention mechanism.
  • VDT utilizes token concatenation for fast convergence and good results.
  • Disintegration experiments reveal that model performance is directly related to the computational resources consumed; the more resources, the better the results.
  • With more computational power, surpassing Sora is not as challenging as it seems.

Below is the full text of Lu Zhiwu's speech:

Why the sudden shift to using Transformers for video generation?

In today's presentation, I will focus on our work in the field of video generation, particularly VDT (Video Diffusion Transformer).

This work was published on arXiv in May last year and has been accepted by the top machine learning conference, ICLR. Next, I will discuss the progress we have made in this field.

It is well known that Sora is outstanding, but what are its advantages? Previously, all work was based on the Diffusion Model, so why did we suddenly switch to using Transformers for video generation?

The transition from Diffusion to Transformer is due to the following reasons:

Unlike the U-net-based Diffusion model, Transformer has many advantages such as tokenization and attention mechanisms, enabling it to better capture long-term or irregular temporal dependencies. Therefore, in the video domain, many works are starting to adopt Transformer as the base model.

However, these are superficial observations. What is the fundamental reason behind using Transformers for video generation? The use of Transformers in video generation is driven by the scaling law behind it.

While the model parameters of the Diffusion Model are limited, once Transformer is used as the base model, the number of parameters can be increased arbitrarily. With sufficient computational power, better models can be trained. Experimental evidence shows that increasing computational resources leads to improved results.

Of course, video generation involves various tasks, and using Transformers can unify these tasks under one architecture.

Exploring the use of Transformers as the foundation for video generation based on the three reasons mentioned above was our consideration at that time. s_7942be3c78ed4858a98479d74fee5067.jpg Our innovation has two key points:

First, we apply Transformer to video generation, combined with the advantages of Diffusion. Second, in the modeling process, we consider unified spatiotemporal masking, placing equal importance on both time and space.

Whether it's VDT or Sora, the first step involves compressing and tokenizing the video.

The main difference from DM-based methods is that DM-based methods can only compress space, not time; whereas now, we can consider both time and space simultaneously, achieving higher compression levels.

Specifically, we need to train a 3D quantized reconstructor in spatiotemporal space, which can serve as a tokenizer to obtain patches in three-dimensional space.

In summary, through this approach, we can obtain the input for the Transformer, which is actually 3D tokens.

Once we tokenize the input video, we can model the 3D token sequence using a standard Transformer architecture, just like with a typical Transformer, without going into further detail.

What are the differences between VDT and Sora?

The most crucial part of the VDT model is the spatiotemporal Transformer Block.

One key difference from Sora is that when designing this Block, we separated the spatiotemporal attention. Since our university team doesn't have as many computational resources as OpenAI, separating them reduces the computational resources required significantly—apart from this, all other designs are exactly the same. s_a3f4cbb900bf4c62876421a992d3b0f0.jpg Now, let's take a look at the differences between us and Sora.

As mentioned earlier, VDT utilizes a spatiotemporal separated attention mechanism where space and time are treated independently, serving as a compromise in scenarios with limited computational resources.

On the other hand, Sora employs a unified spatiotemporal tokenization approach, with an attention mechanism that integrates space and time. We speculate that Sora's robust physical world modeling capabilities mainly stem from this design.

While the input conditions differ, this isn't the primary distinction between VDT and Sora. Essentially, both models perform well on tasks involving graphical and textual video generation.

Textual video generation poses greater challenges, but they are not insurmountable and do not fundamentally differentiate the models.

Next, I will introduce some aspects we explored at the time. Following the completion of the architectural design, we paid particular attention to the input conditions. Here, we have the Condition Frame represented by C, and the Noisy Frame represented by F.

We explored three ways to combine these two input conditions:

  • Through normalization;
  • Through token concatenation;
  • Through cross-attention.

Among these methods, we found that token concatenation yielded the best results. It not only led to the fastest convergence but also produced the most effective outcomes, hence why VDT adopted the token concatenation approach.

We also focused on a universal spatiotemporal masking mechanism. However, since Sora has not disclosed details, we are unsure if it also employs this mechanism. During model training, we emphasized the design of such a masking mechanism, which ultimately proved highly effective, enabling smooth completion of various generative tasks — we found that Sora could achieve similar results.

Renowned scholar Lu Zhiwu once said: "As long as we have more computing power, surpassing Sora is not that difficult."

The ablation experiments were particularly intriguing. Both Sora and VDT face a crucial issue — the presence of numerous hyperparameters in the model. These hyperparameters are closely tied to the model and can significantly impact its performance.

Through extensive experimentation, we discovered a pattern in hyperparameter selection: if a hyperparameter increases the computational load of model training, it tends to benefit the model's performance.

What does this imply? The performance of our model is solely linked to the computational load it carries. The more computing resources required for training, the better the final generative outcomes — it's as simple as that.

This finding is akin to DiT, known as the foundational model for Sora, used in image generation.

In conclusion, ablation experiments stand as one of the most critical aspects of Sora or our work. The effectiveness of our model is directly tied to the computational resources consumed during training — the more resources utilized, the better the results.

Having more computing power makes surpassing Sora not too challenging.

Considering our limited computational resources, our team undoubtedly cannot match OpenAI in terms of model training scale. Nevertheless, we have engaged in profound reflections.

The simulation of the physical world itself is present in our research paper, not something solely conceived by OpenAI. We had thought of this a year ago.

With this foundation in place, it naturally led us to question whether such models could indeed simulate physical laws. Subsequently, after training VDT on a physics dataset, we found it excelled in simulating simple physical laws such as projectile motion, accelerated motion, and collision motion — all simulated quite effectively. s_261b9c0a9b93481d86b3f303fd73feca.jpg So we did two particularly forward-thinking things at the time. One was incorporating the Diffusion Transformer into video generation, and the other was realizing that this model was excellent for simulating the physical world. We conducted experiments to validate this.

Given more computational power and data, we believe we could certainly simulate more complex physical laws.

Our model has been compared to existing models, such as portrait generation where a still photo is animated. We focused on this smaller task due to our limited computational resources.

The results indicate that the VDT outperforms the Stable Video Diffusion. You can see that the generated characters blink more prominently and naturally. The other model's output appears somewhat unnatural.

Moreover, predicting a face when it turns from a profile view to a frontal view or even when partially obscured by a fan remains quite challenging. s_da43961eab7d41c98c6f76c950d7b190.jpg I'll briefly explain how this portrait video was created.

First, several photos of the portrait were provided. VDT turned each portrait photo into a two-second clip and then edited these clips together.

Considering the characteristics of our team, if I were to create a general model, I wouldn't surpass most models on the market. However, I chose a specific application point where VDT is not inferior to Sora.

After Sora emerged, many people started working on video generation. I needed to ensure that my team, even in a small aspect, remained at the forefront in this direction.

Therefore, we focused on portrait video generation, an area also explored by foreign companies like Pika and Sora. The hyper-realistic characters generated by VDT surpass those of Pika and Sora. While it's challenging to surpass Sora in general video generation, our main limitation lies in our limited computing power.

With access to more computing power, surpassing Sora wouldn't be so difficult.

That's all I have to say, thank you everyone.