Home > News > Techscience

New Achievements from Nankai University Accelerate Training of Sora Core Components by Over 10 Times

GaoYuTong,ChenBin Sun, Mar 31 2024 10:55 AM EST

At the beginning of 2024, OpenAI, the parent company of the AI mega-model ChatGPT, released its first AI text-to-video mega-model called Sora. Utilizing computer vision technology to simulate dynamic changes in the real world, Sora can generate smooth and realistic 60-second videos in one go, marking another significant breakthrough in AI technology after ChatGPT. However, some "train wreck" videos of Sora's performance indicate that AI still struggles with rapid and accurate "understanding" of the physical world.

Recently, Professor Cheng Mingming's team from Nankai University and Nankai International Advanced Research Institute (Shenzhen Futian) announced an international collaborative research achievement called MDT. Compared to Sora's core component DiT, MDT accelerates training speed by over 10 times, once again pushing the state-of-the-art (SoTA) in image generation quality and learning speed. It achieved an FID score of 1.58 on the ImageNet benchmark, surpassing models proposed by well-known companies like Meta and Nvidia. The research team has also made the MDT source code fully available.

Diffusion models represented by one of Sora's core components, DiT, can generate high-quality images "out of thin air," making them one of the major highlights in AI technology in recent years. However, DiT often struggles to efficiently learn the semantic relationships between different parts of objects in images, leading to low convergence efficiency during training. Additionally, larger model and data scales consume significant computational power, resulting in skyrocketing training costs.

"To illustrate with an example of generating an image of a dog using DiT, it learns to generate the fur texture of the dog by the 50,000th training step, then learns to generate one eye and the mouth by the 200,000th step, but misses generating the other eye. Even by the 300,000th training step, the relative position of the dog's two ears generated by DiT is not very accurate," said Cheng Mingming. "In simple terms, it's like overlooking the semantic relationships in context while reading comprehension, resulting in frequent deviations in generated images that require repeated corrections, significantly increasing training costs."

How to reduce training costs and improve training efficiency? The research team introduced contextual representation learning during diffusion training, enabling the utilization of contextual information of objects in images to reconstruct complete information from incomplete input images, thereby enhancing the understanding of semantic relationships between parts of images and improving the quality and speed of image generation. The relevant paper has been published at the top-tier computer vision conference, the International Conference on Computer Vision.

Recently, the research team upgraded MDT to version 2 (MDTv2), introducing a more efficient macroscopic network structure and further optimizing the learning process. They also accelerated the model training process by adopting faster AdaN optimizers, expanding mask ratios, and employing other superior training strategies. Experimental results demonstrate that enhancing semantic understanding of the physical world through visual representation learning can improve the simulation effectiveness of generative models on the physical world.

Cheng Mingming said, "We hope our work can inspire more research on unified representation learning and generative learning, elevate the 'intelligence' level of AI mega-models, and address real-world problems in more scenarios."

Related paper link: https://arxiv.org/abs/2303.14389