Home > News > AI

Who is the Chinese Version of Sora

Sat, May 11 2024 08:12 AM EST

"Facing the challenge brought by Sora, let the bullets fly a little longer." Over two months ago, OpenAI dropped another bombshell with the release of the large-scale model Sora, sparking global attention. Galileo Capital partner Zheng Xuan commented on the gap between domestic and international large-scale video models at that time. Fast forward two months, and the "prediction" has come true. First, Shengshu Technology and Tsinghua University jointly released the video model Vidu, once dubbed as China's first Sora-level video model. Recently, there have been reports that Zhipu AI is also developing a domestic large-scale video model benchmarking Sora, set to be released within the year. With companies rushing into the field, the development of domestic large-scale video models has clearly entered an accelerated phase. However, as Zheng Xuan pointed out, the emergence of Sora is not a technical breakthrough, and in terms of engineering, the gap between domestic large models is not that far. "Essentially, perhaps the scene is more worth considering than engineering breakthroughs."

After Sora

Recently, there have been reports that Zhipu AI is developing a high-quality large-scale video model benchmarking Sora, expected to be released within the year. In response, a reporter from Beijing Business Daily contacted Zhipu AI, and they stated that the news was unofficial and they had no further information to provide.

Public information shows that Zhipu AI originated from the technology achievements of the Department of Computer Science at Tsinghua University and is one of the earliest companies in China to develop large models. In January of this year, Zhipu AI released the new generation base model GLM-4, with CEO Zhang Peng stating that GLM-4's overall performance has significantly improved compared to the previous generation, approaching GPT-4.

Prior to this, there had already been a wave of domestic large-scale video models. At the 2024 Zhongguancun Forum Annual Meeting on April 27, Tsinghua University and Shengshu Technology officially released China's first long-duration, high-consistency, high-dynamic video model Vidu, sparking discussions.

According to reports, Vidu is the first major breakthrough in video models globally since the release of Sora, with performance on par with top international standards.

"Vidu is the latest full-stack independent innovation achievement, achieving technological breakthroughs in multiple dimensions, including simulating the real physical world, having imagination, understanding multi-camera languages rather than simple camera zooming, generating videos up to 16 seconds with a single click, maintaining high consistency in character scene time, and understanding Chinese elements." At that time, Tsinghua University professor and Shengshu Technology's chief scientist Zhu Jun introduced.

Regarding the comparison between Vidu and Sora, Zhu Jun also demonstrated on-site. For example, while Sora dropped the key word "rotation" during video generation, Vidu could better capture this content, achieving smooth "rotation" of video perspectives.

However, some analysis points out that there is still a significant gap in computing power and engineering between Vidu's 16 seconds and Sora's one minute. In response, an industry insider mentioned to a reporter from Beijing Business Daily that Vidu's architecture itself is sufficient to support longer video generation, and Shengshu Technology also stated that Vidu is undergoing accelerated iteration and improvement.

It is worth mentioning that both Zhipu AI and Shengshu Technology are from the "Tsinghua lineage." In addition, companies like Guangyuan Beyond, Dark Side of the Moon, Baichuan Intelligence, and Mianbi Intelligence all have shadows of Tsinghua students. Some media outlets have quoted industry insiders' analysis, stating that the layout of Tsinghua-based large model companies centers around Zhipu AI, strategically positioning themselves in the upstream and downstream of artificial intelligence. In March of this year, Shengshu Technology announced the completion of a new round of financing in the hundreds of millions, with Zhipu AI being one of the investors.

Productization is Key

In fact, since the release of Sora, the domestic large-scale video field has been heating up. For example, in February when Sora was released, Tsinghua University announced a video-related patent. In the same month, China's first video AI animation "Qianqiu Shisong" was broadcasted. The day after Vidu was released, the first domestic audio and video multimedia large model Wanxing "Tianmu" officially entered public testing.

According to Gartner's research forecast, by 2030, 90% of digital content will be AI-generated. It is estimated that by 2032, the global AIGC market size will increase from $108 billion in 2022 to $1,181 billion.

Economist and new finance expert Yu Fenghui analyzed to a reporter from Beijing Business Daily that the successful construction of video models means that AI models can handle higher dimensions and more complex data, engaging in creative expression. This indicates that models are evolving towards understanding and creating different aspects of the world, getting closer to the cognitive and decision-making abilities pursued by AGI.

"Once video technologies like Sora mature, they have the potential to disrupt multiple industries such as media, film production, game development, virtual reality, advertising creativity, and education. They can generate high-quality video content according to user needs in a short time, significantly reducing production costs and improving efficiency," Yu Fenghui added.

In an interview with Beijing Business Daily, Zheng Xuan mentioned that video models can be likened to segmented scripts, using text information to generate keyframes, forming continuous videos frame by frame. In this process, it involves more engineering innovations rather than revolutionary breakthroughs at the technical level, indicating that the gap between domestic and international large models will not be too long, with an overall time difference of within half a year.

Therefore, compared to engineering breakthroughs, Zheng Xuan is more concerned about application scenarios. He observed that AI short films are still very niche within the industry, more like experimental attempts, with a significant gap compared to mature commercial productions, which can be "basically ignored." What is more lacking is the inference computing power

As enterprises rush into the field of large-scale video models, another crucial issue has emerged - computing power. Shortly after the release of Sora, 360 Group founder Zhou Hongyi publicly mentioned that if Sora's technical roadmap were to be open-sourced, domestic capabilities could quickly catch up. However, in the pursuit of Sora, computing power could potentially become a bottleneck.

CITIC Securities once roughly estimated that a 60-frame video (about 6-8 seconds) requires around 60,000 patches. If the denoising step is set at 20, it would equate to generating 1.2 million tokens. Considering that diffusion models often require multiple generations in practical use, the actual computational workload would far exceed 1.2 million tokens.

Angel investor and seasoned AI expert Guo Tao analyzed to a reporter from Beijing Business Daily that training large models necessitates handling vast amounts of data and complex computations. Without sufficient computing power, training such models would be extremely challenging. Furthermore, global computing resources are limited, with the majority concentrated in the hands of some large tech companies, posing challenges for other companies or research institutions to access ample computing resources.

Not long ago, Kimi, the intelligent assistant from the dark side of the moon, became a sensation due to a surge in users. Kimi App and mini-programs experienced temporary malfunctions. At that time, CITIC Jiantou released a research report stating that with the continuous increase in Kimi's user base, there was a brief shortage of computing support. Considering the subsequent model training and inference demands, it is expected that the need for computing power will further escalate, driving the implementation of computing power requirements.

"Inference computing power is likely to be the next opportunity in the venture capital circle," concluded Zheng Xuan.

By Yang Yuehan, Beijing Business Daily