Home > News > AI

Why Wasn't Sora Born in China?

Tue, Mar 19 2024 07:38 AM EST

No sound, no show – even the best performances fall flat without it.

In the box below, type "Medieval Trumpeter," flick the sound switch, click to generate the video, and voila! A 4-second AI-generated video pops up on the screen. Not only do you see an image of a medieval court musician dressed in regal attire, but you also hear the trumpet being played.

On March 10th, Beijing time, Pika Lab, a Silicon Valley AI startup, unveiled a new feature of its in-house video generation model capable of producing both visuals and sound simultaneously. Previously, all AI-generated videos lacked audio. While this feature isn't yet available to the public, it provides a glimpse into the rapid evolution of AI technology.

On February 16th this year, OpenAI launched Sora, a large-scale model capable of generating videos based on text prompts. With just a few simple instructions, Sora can accurately "understand" text and produce videos up to 60 seconds long, attracting global attention. Some industry insiders have hailed Sora's debut as the "ChatGPT moment" in the field of video generation. On March 8th, local time, after months of behind-the-scenes drama, OpenAI's co-founder, Sam Altman, returned to the board to continue driving the company towards its mission of achieving Artificial General Intelligence (AGI).

What does Sora's sudden emergence signify? How far are we from AGI, and what direction will AI take next? ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2Fe36246b5j00sadce6000qd000hs00c2m.jpg&thumbnail=660x2147483647&quality=80&type=jpg Sam Altman Unveils Text-Generated Videos on Social Media

Another Testament to the Power of Innovation

Before the release of Sora, OpenAI had kept its idea of entering the realm of text-generated videos under wraps. Even as the new year dawned, the limelight in the global text-generated video arena remained on startups like Pika, Runway, and Stability AI.

At the end of last November, Pika debuted its first-gen text-generated video product. Users entered keywords like "Elon Musk in a spacesuit, 3D animation," and a cartoon version of Musk promptly appeared, with a SpaceX rocket soaring into the sky behind him. The video lasted only three to four seconds, yet its clarity and smoothness far surpassed other products. Reflecting on this, Pika's co-founder Meng Chenlin speculated during an interview, "Perhaps GPT hasn't been used for videos because their resources and manpower are focused on text models."

Fast forward two months, and Sora dazzles the scene. In a recent demo by its chief technologist, inputting "a flight through a museum, enjoying various paintings, sculptures, and beautiful artworks along the way," the AI churns out a 60-second video. Viewers follow the camera as it swoops from the air into the museum, navigating through multiple galleries and rooms, brushing past sculptures along the way.

Assistant Professor Liu Ziwei from the School of Computer Science at Nanyang Technological University in Singapore told China News Weekly that OpenAI's foray into text-generated videos wasn't surprising. OpenAI has always aimed for AGI (Artificial General Intelligence). "In the pursuit of AGI, AI needs not only to 'read ten thousand books' but also to observe various physical phenomena in the world. OpenAI will definitely expand into multimodal fields like text, images, audio, and video. Video is a crucial step in multimodal development, encompassing the fundamental laws of the world."

Liu Ziwei remains awestruck by the video effects produced by Sora. He began researching AI video generation three years ago. Compared to text and images, AI video generation presents the greatest technical challenge, requiring high resolution, smooth content flow, and consistency in video data, along with significant computational power. Before Sora, most similar products on the market suffered from low video clarity and issues like flickering frames and distorted characters. Sora's generated videos maintain excellent three-dimensional consistency. Interactions between subjects and environments, such as the movement of water and clouds or birds flying through forests, to some extent, reflect the realism of the physical world.

In the technical report on Sora published on its official website, OpenAI emphasizes the importance of the Diffusion Transformer (referred to as DiT, based on the Transformer architecture). This is a new model synthesized from two models. The fusion of these models is key to Sora's success. Diffusion is an effective content generation model previously demonstrated to generate realistic and high-quality images in the image generation domain. The Transformer forms the foundational architecture of large language models like GPT. The ability of models like ChatGPT to respond fluently stems from this architecture's capacity to better capture contextual information and generate more logically consistent text by predicting the probability of the next token (the smallest unit of text).

Nie Zaiqing, Chief Researcher at the Institute of Intelligent Industries, Tsinghua University, explained to China News Weekly that a major "secret recipe" for OpenAI's video data training is to split videos of different sizes and resolutions into patches (visual patches, akin to tokens), then directly input them into the model for learning. OpenAI's official introduction states that Sora can sample videos in widescreen 1920x1080p, vertical screen 1080x1920p, and all resolutions in between. Additionally, OpenAI generates subtitles for the training video set, which can improve text fidelity and the overall quality of the video.

However, industry consensus holds that the DiT model is an open secret, and in terms of underlying technology, Sora hasn't introduced innovation. As early as the end of 2022, DiT was proposed. At that time, William Peebles, a doctoral student at the University of California, Berkeley, and Saisning She, an assistant professor at the New York University School of Computer Science, jointly published a paper. They creatively fused Transformer with Diffusion in the field of text-generated images, causing a stir in academia. Liu Ziwei explained to China News Weekly that since last year, teams internationally have been exploring the use of the DiT architecture to train text-generated video models, including his own team. "It's a natural choice." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2F2cdffa77j00sadce6003rd000hs015fm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Sora官网发布的一些文字生成的视频截图。

当时,文生成视频模型有多条技术路径可选,但由于算力和数据的限制,DiT路径尚未完全成熟,学术团队和创业公司无法全力以赴。OpenAI选择了一条不太常见的道路。据刘子纬表示,“Sora背后的关键并非模型的突破,而是OpenAI在大模型系统设计上的胜利”。这种设计涉及训练数据的细节,以及OpenAI在算力、人才组织架构等方面的积累。尽管这些因素至关重要,但OpenAI在公开信息中几乎没有提及。

Sora的成功复制了ChatGPT的经验,再次证明了“大力出奇迹”的暴力美学,以及OpenAI“遇事不决,扩大模型”的核心价值观的可行性。据清华大学计算机系副教授、人工智能初创公司壁智能联合创始人刘知远表示,Sora就像是AI视频生成的“GPT-3时刻”,它证明了数据的重要性,即高质量、大规模的数据可以训练出一个文生成视频模型。

中国科学院深圳先进技术研究院数字所研究员董超长期从事底层机器视觉研究,目前正在与团队开发多模态模型。他强调,选择哪些数据、如何筛选和标注数据,直接影响模型生成的效果。要想让大模型生成高质量的视频,需要训练数据具有高分辨率、丰富的场景细节以及协调的人物、物体和背景比例,同时还需要排除一些场景转换过快的情况。

Pika联合创始人孟晨琳也提到,一些电影中有很多精美的视频,但如果大部分内容都是人物站着说话,动作单一,那并不是训练大模型的优质数据。此外,版权问题也可能影响企业收集到足够多的高质量视频。

董超认为,在数据之后,人才团队至关重要,“训练大模型绝非一件简单的事情,没有经验根本调试不好,通常需要团队中最优秀的人来负责。许多国外科技公司顶尖的AI人才都会亲自处理数据、编写代码。”

据OpenAI官网介绍,Sora的核心团队共有15人。公开资料显示,团队成立时间不到1年,其中三位研发负责人中有两位于2023年从加利福尼亚大学伯克利分校毕业,其中一位是前述DiT论文的作者之一威廉·皮布尔斯,另一位蒂姆·布鲁克斯曾在谷歌工作近两年,在伯克利读博期间主要研究图片和视频生成。布鲁克斯和另一位研发负责人阿迪亚·拉梅什都是OpenAI开发的文生成图模型DALL-E 3的创造者。

从GPT-3、GPT-3.5再到GPT-4,OpenAI积累了丰富的大数据训练、生成和治理能力,为Sora提供了基础设施支持。刘子纬对《中国新闻周刊》表示:“Sora团队只有十几人,这说明OpenAI为他们提供了重要的底层支持,包括组织架构、人才管理和基础设施,这才使有想法的人真正做出了能够影响世界的成果。”

通用人工智能加速到来?

尽管目前的Sora还不完美。根据OpenAI官网公开的生成视频,Sora可能产生不符合常识的幻觉,例如生成的椅子可能会变形,水杯摔碎前水已洒在桌面上,这明显违反了物理学原理。在公开的技术报告中,OpenAI写道:Sora可能难以准确模拟复杂场景的物理原理,或难以理解因果关系,分不清左右,也可能难以精确描述随着时间推移发生的事件等。

这与ChatGPT的一本正经的胡说八道相似。清华大学人工智能研究院常务副院长、计算机系自然语言处理与社会人文计算实验室负责人孙茂松解释说,这是基于Transformer架构模型的“软肋”。科学家们曾希望人工智能能像人类一样进行“演绎推理”,但经过多年的努力,仍然未能实现。Transformer的成功使AI产生了令人惊叹的生成能力。然而,它也有另一面,即它不会像人类一样思考,可能会产生幻觉。

孙茂松认为,Sora目前另一个不足之处在于可控性较差。如果让Sora生成复杂场景 ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2Fc5b7df70j00sadce6000sd000hs00a0m.jpg&thumbnail=660x2147483647&quality=80&type=jpg Pika Lab Website Display of Text-Generated Video (Screenshot)

The video generated by text displayed on the Pika Lab website (screenshot) showcases a significant leap forward in AI text-to-video synthesis, despite its brief one-minute duration. "If we consider the current level of generation, extending the duration from 1 minute to 5 minutes only requires increased computational power," says Sun Maosong. "Essentially, it involves continuously predicting the next frame by the model." However, achieving precise control over generated videos requires more than just computational power; it demands higher algorithmic sophistication, which may take several years of technological advancement. If this challenge is overcome, it would mark a breakthrough surpassing ChatGPT.

Sora Sparks Industry Sensation as OpenAI Labels it "World Simulator"

The excitement surrounding Sora stems from OpenAI's designation of it as a "world simulator." According to OpenAI, after extensive training on massive datasets, Sora has exhibited new capabilities to simulate certain aspects of people, animals, and environments from the physical world. For instance, when generating a scene of someone eating a hamburger, Sora not only depicts the actions of eating but also considers generating bite marks. These emergent capabilities arise without explicit data labeling. OpenAI firmly believes that expanding video models continuously is a potent path towards developing high-performance simulators for both physical and digital worlds.

Debate Over Sora's Status as a World Simulator

Liu Ziwei explains that while OpenAI emphasizes the concept of a world simulator in relation to achieving AGI, there remains controversy over whether Sora truly qualifies as such. Jim Fan, Chief Research Scientist at NVIDIA AI Research Institute, asserts that "Sora can simulate countless real or fictional worlds." However, Turing Award laureate and Chief Scientist at Meta, Yang Likun, argues that "modeling the world through pixel generation is a waste... bound to fail." Meanwhile, Lin Dahua, Leading Scientist at the Shanghai Artificial Intelligence Laboratory, acknowledges Sora as a milestone breakthrough in video generation. However, he underscores the vast gap between generating lifelike videos and mastering physical laws, let alone achieving AGI. The deeper we delve into testing GPT-4, the more apparent it becomes that humanity is still far from AGI.

The Lack of Consensus on World Simulation and AGI

Currently, both academia and industry lack consensus on what constitutes a world simulator. The underlying fundamental disagreement lies in how to define AGI. Scientists like Yang Likun advocate for AI to systematically comprehend the operational principles of the human world, rather than being merely a super machine learning vast amounts of human knowledge. On the other hand, proponents of OpenAI argue that AI does not necessarily need to understand the underlying physical laws; it only needs to continually and accurately predict the next frame to reproduce changes in the world, thereby aiding humanity in achieving AGI. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2F9dd6b2bbj00sadce6000rd000hs00a0m.jpg&thumbnail=660x2147483647&quality=80&type=jpg

Runway官网展示的文字生成视频(截图)

Runway文字生成视频截图

何为AGI?

今年全国两会中,北京通用人工智能研究院院长朱松纯给出了对AGI的定义:在日常物理和社会场景中能完成无限任务、能自主发现任务,即“眼里有活”、有自主价值驱动。1月底,该研究院在京展出了全球首个通用智能人的雏形——小女孩“通通”。朱松纯称,“通通”具备三四岁儿童完备的心智和价值体系,目前还在快速迭代中。他认为,日常生活中最常见的能力背后,都是AGI的核心技术问题之一。“实现通用人工智能,关键在于为机器‘立心’。”

Sora的定位与挑战

Sora在一定程度上体现了真实世界的物理规律,但尚未达到理性构建世界的程度。刘知远指出,人类理解世界也分为不同阶段,类比于学前和学后的阶段。虽然Sora具备了语言能力和感性知识,但是否扩大模型以模拟世界是一个长期而言的问题。科学家需要探索其他方式,使大模型能够理性认识世界。

未来展望与挑战

2022年下半年,孙茂松预测多模态大模型在2024年会迎来突破,特别是文生视频模型。多模态技术逻辑上会向视频生成发展,但AI未来的突破领域尚不确定。

AI安全性与深度伪造

Sora的出现加深了对深度伪造的担忧。AI生成视频的门槛降低,难以辨别真伪。团队曾与机构合作做深度伪造检测,但随着技术的进步,视频质量不断提高。社会需要提高对AI安全性的认识,设计时应考虑加强安全性措施。

OpenAI的监管与前景

OpenAI发布的ChatGPT引发了全球对生成式AI监管的讨论,因此OpenAI更加谨慎。设计大模型时,技术人员会与“红队”人员合作进行对抗性测试,以发现潜在的危险性和滥用可能性。

中美AI发展差距与竞争

中美AI发展存在差距,中国面临算力等挑战。不过,与十年前相比,中国在AI人才储备和科研成果等方面与美国的差距已经缩小。全球其他国家和美国的科技公司也在追赶OpenAI。

复刻Sora的挑战

复刻Sora并不简单,因为模型只是冰山一角。关键在于聚集聪明的人才,发挥每个人的才智。要实现80%的复刻可能需要大约1年时间。 In China, why hasn't there been a Sora-like project? According to Dong Chao, it primarily boils down to the talent gap. The Sora team's Ph.D. students have extensive experience training large models like GPT in the front lines. However, in China, such talents often find themselves leading teams of dozens, making it difficult for them to be hands-on. Secondly, OpenAI boasts a significant per capita computational resource advantage. With over 700 team members, even small internal teams can access thousands of GPUs to experiment with various innovative solutions, thanks to OpenAI's ample patience. In February of this year, The Wall Street Journal reported that OpenAI is planning to raise as much as $5 trillion to $7 trillion, intending to venture into chip-making to provide ample computational power for GPT's development.

In contrast, computational resources in China are scarce. If a team manages to secure 1000 GPUs, it consumes a considerable portion of resources. Consequently, any projects undertaken are subject to intense scrutiny from the outside world. If there are no tangible results within 3 to 6 months of training the initial model, the resources are likely to be reallocated. This makes it challenging for researchers to take risks and pursue innovative endeavors. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2F03ed2b60j00sadce6000ud000hs00ccm.jpg&thumbnail=660x2147483647&quality=80&type=jpg On February 21st, Google unveiled Gemma, its next-generation open-source model.

Dong Chao also mentioned that taking the right path often involves great risks and long cycles, which most teams are hesitant to take. "The text-to-video model is a typical example. OpenAI pursued a pure text-to-video model, retrained it, collected a large amount of data, and achieved results after nearly a year of attempts. Once successful, it's bound to be disruptive." In contrast, the research atmosphere in China is impetuous. Wanting to surpass foreign counterparts in three to five months only leads to patching up others' work, copying, and easily causing overwork, making it difficult to form technological barriers.

At the end of 2022, after ChatGPT gained popularity, hundreds of large model companies emerged in China, attempting to create a Chinese version of ChatGPT. However, a year later, Chinese enterprises still hadn't caught up with GPT-4 in the field of large language models. According to Liu Zhiyuan, if some investors or practitioners, impressed by Sora's capabilities, only see the surface and rush to create a Chinese version of Sora, it's just treating the symptoms without addressing the root cause. If China only follows OpenAI's innovations in business models without continuous investment in underlying technology, then China will never produce GPT-4 and Sora. "Even if we replicate, we must catch up in the right direction," Liu Zhiyuan said.

In Dong Chao's view, we shouldn't overestimate Sora's role, underestimate OpenAI's technical reserves, and should focus more on the logic behind Sora's creation. If we only focus on Sora itself, it's very likely that OpenAI will throw out another "bomb" in a year.

Surpassing OpenAI is not easy. Since OpenAI became a for-profit company in 2019, the company has abandoned its open-source strategy. GPT-3, GPT-3.5, and GPT-4 released thereafter are no longer open-source, and even the model parameters are not disclosed. OpenAI has even been jokingly called ClosedAI by Elon Musk. At the end of February this year, former board member of OpenAI, Musk, even sued OpenAI, its CEO, and president, accusing OpenAI of deviating from its "original intention" and demanding that OpenAI resume open-source practices and compensate. Subsequently, OpenAI responded that with the enhancement of large model capabilities, open-sourcing would enable unethical people to use a large amount of hardware to build unsafe artificial intelligence, so reducing openness is meaningful.

Whether large models should be open-source has sparked huge controversy at home and abroad. The development of AI relies on open-source, relies on the developer community, and global researchers can continue to contribute code to help solve problems, build more transparent artificial intelligence, and resist the monopoly of large companies. When OpenAI was founded, it was also a staunch supporter of open-source. However, the closed-source path for large models can concentrate the company's resources and achieve continuous development through the iteration of internal user data. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0315%2F9f317c65j00sadce6001sd000hs00qom.jpg&thumbnail=660x2147483647&quality=80&type=jpg The future development direction of AI is a topic of global concern.

Since last year, AI companies such as Meta and the emerging French AI company Mistral have successively launched open-source large models. On February 21st, Google released a new generation of open-source model called Gemma, claiming to be "the most powerful and lightweight globally," which seems to be a challenge to OpenAI. However, the recognized reality is that the current strength of open-source models still cannot match that of closed-source models, and some practitioners have even stated that open-source models can never surpass closed-source models. In the view of Liu Ziwei, open-source large models have significant value. They are like the electricity system, providing a "infrastructure" for more developers to combat the monopoly of big tech companies. He judges that the development of open-source models will get better and better. Although it may not reach the level of closed-source models, open-source large models may surpass closed-source large models in certain characteristic capabilities in the future.

Several interviewees mentioned that compared with the United States, China's advantage lies in the diversity of commercial application scenarios. Some domestic large model manufacturers can better think about how to serve users, but it still requires companies to cultivate "internal skills" in self-developed large models. Along the current trend of "great achievements with large models," the "technological explosion" of OpenAI will not last long. Although it has the first-mover advantage, it does not mean that it cannot be caught up with. If the infrastructure is laid step by step in the future, the gap will gradually narrow.

In a discussion on technological innovation in 2023, Zhu Songchun mentioned that if we continue to follow the past path of "catching up—keeping pace—leading," it will form a "basketball game" research model. Basketball represents technological hotspots, and the ball controller has always been a technological powerhouse. Our team has been chasing the ball all over the court. Not only will we lose our focus, but frequent changes in direction and technology will also scatter the team. More importantly, the ball controller has completed the layout of the software and hardware ecosystem, forming a situation where emerging industries are "neck and neck."

Zhu Songchun believes that we should abandon the tactics of "playing basketball" and learn the strategy of "playing chess," focusing on the overall situation. Instead of blindly "catching up" with the current AI hotspots characterized by "big data, big computing power, and big models," we should shift from a defense strategy that is busy "shoring up weaknesses" to simultaneously focusing on an offensive strategy of "building strengths." We should explore our own innovative path.

Published in the 1132nd issue of "China News Weekly" magazine on March 18, 2024.

Magazine Title: "Where will Sora take AI?"

Reporter: Yang Zhijie

Editor: Du Wei