Home > News > AI

Is Sora an Opportunity or a Challenge for Chinese Tech Giants?

蓝鲸财经 Tue, Feb 27 2024 01:20 AM EST

Setting aside the technical aspects, from the perspective of practical application, do domestic large-scale model enterprises have the same "opportunity for breakthrough" in video generation as Sora?

In the first month of the Year of the Dragon, just like ChatGPT last year, OpenAI released another bombshell at the beginning of the year - Sora in the field of text-to-video synthesis.

Facing such AI generation capabilities, almost all types of practitioners have felt a significant shock. An IT professional who is also a film producer told Lujr Business Review that the impressive performance of Sora has caused a sense of crisis among his fellow practitioners. With the significant reduction in film production costs and the emergence of new filmmakers, it will be easier than ever before.

However, when faced with the questions raised by Lujr Business Review regarding whether Sora has the conditions for commercialization and whether text-to-video synthesis requires higher computational power and how to address these challenges, the film producer responded with the phrase "development problem, development solution".

This response seems overly optimistic. After all, more professionals believe that even Sora, despite its technical achievements, still has many immature aspects before it can reach the stage of industrialized commercialization.

Therefore, setting aside the technical aspects, in terms of practical application, do domestic large-scale model enterprises that have relevant layouts in text-to-video synthesis have the same "opportunity for breakthrough"? What substantial leaps does text-to-video synthesis make compared to previous text-based models? This is an interesting topic.

Sora, revolution or hype?

It must be acknowledged that the emergence of Sora has brought the realization of general artificial intelligence (AGI) one step closer. This is because it can simulate the motion of objects in the real physical world, such as object movement and interaction.

However, this level of improvement alone is not considered "amazing". According to OpenAI's official report, the "revolutionary" aspects of Sora mainly include:

Firstly, the duration. As a general text-to-video large-scale model, it can generate videos up to 60 seconds long based on text descriptions provided by users. Not only are the video quality excellent, but it also more accurately reproduces the user's input prompt.

Secondly, there is a breakthrough in the complexity of scenes and the level of character generation. So far, Sora has been able to generate scenes that include multiple characters, specific types of motion, and complex background details. The language of the shots has also become more complex, giving the videos themselves a certain narrative function, which is what the short video industry currently needs.

Furthermore, in addition to text-to-video synthesis, Sora can also animate still images or generate new videos from existing ones, achieving effects such as filling in missing frames or extending video content.

A senior technology journalist expressed to Lujr Business Review that the emergence of AI products like Sora provides an opportunity for "equality of thinking". Because some technology journalists who have been following the industry for a long time often have "wild ideas", but without suitable tools to bring those ideas to life. However, with AI tools like GPT and Sora, journalists can now potentially realize their products once they see the opportunity and have an idea, and the remaining task is to verify the feasibility of the product.

However, Lujr Business Review found through discussions with multiple industry insiders that even Sora, which is currently in the limelight, may still be overestimated.

Li Mingshun, the chairman of Xing Xing AI, has a more rational view. In his opinion, the emergence of Sora is largely a stage of technological iteration, extending the general text-based model to the video domain. The significant improvement in Sora is also largely due to the unlimited investment in computing power and funding, as well as the repeated training on massive datasets, resulting in a "miracle through great efforts".

Compared to its technical superiority, Sora's superior "resource endowment" position clearly puts it far ahead of many domestic "computational resource shortages" manufacturers. This is a gap that domestic large-scale model-related manufacturers will find difficult to overcome for a considerable period of time.

From an investment perspective, vertical "general models" like Sora are not considered hot targets.

A primary market practitioner told Lujr Business Review that in pure primary market investments, investments are usually made in big concepts and high valuation targets. This is mainly because the fund life cycle in the primary market is 7 years, with a 2-year investment period, and a 5-year exit is a high probability event. However, whether the vertical model of text-to-video synthesis can achieve industrialized commercialization within 5 years, no one can be certain.

In addition, there is limited information available about Sora currently, with only a technical report released on February 15th, but news of financing emerged three days later. In the financing led by Thrive Capital, which took place without open usage and the outside world knowing its actual level, OpenAI's valuation has approached $80 billion. This primary market practitioner candidly admitted to Lujr Business Review that this technology release is likely part of OpenAI's "valuation management".

Zhou Yahui, chairman of Kunlun Wanwei, stated in his circle of friends, "(Silicon Valley) scientists and engineers here do not recognize the stock value of any startup company except OpenAI. They consider it as paper wealth. They would rather choose a $1 million package (half in stocks) offer from OpenAI, Google, Facebook, or Microsoft than a $3 million (80% in stocks) offer from a startup company."

It can be seen that after Sora, OpenAI has further widened the gap with other AI giants.

Chinese large-scale models, the risks and opportunities for manufacturers

Although Meta, Google, and Microsoft are also eager to make moves, compared to the capital market's frenzy over Sora, domestic large-scale model manufacturers appear to be much more cautious. Most Chinese manufacturers still focus on the development of large-scale models based on their own applications, rather than pursuing so-called native AI model upgrades.

ByteDance is one of them, and its conservative attitude towards generative AI has been evident since the text-based model stage. However, from the perspective of entering the game, ByteDance is not late. According to Wan Dian's report, after OpenAI released GPT-3 in June 2020, ByteDance trained a generative language model with tens of billions of parameters. By following the routine development, Byte and OpenAI's GPT will not be far apart by 2023. However, Byte's investment in generative AI has clearly not been profitable under the leadership of ROI. Therefore, Byte has been slower in exploring generative AI compared to its competitors.

In terms of release timeline, Baidu's Wenxin Yiyin was launched in March 2023 and iterated to version 4.0 in October of the same year. Alibaba's Tongyi Qianwen and Tencent's Hunyuan Assistant followed closely. Byte's release of the Yunque model took place in August 2023.

One result of being a latecomer is the lack of user base. Wenxin Yiyin had already surpassed 100 million monthly active users last year, while Byte's Dou Bao is still under 10 million. However, with the appointment of Zhang Nan as the head of Jianying, Byte is expected to make faster progress in generative AI.

While Byte may not have immediately usable products in the field of generative video, Baidu and Alibaba do. At last year's Baidu World Conference, Baidu demonstrated the generative video capabilities of Wenxin Yiyin, mainly integrated into the "Yijingliuying" plugin.

Of course, the generative videos showcased at the conference were successful cases from numerous attempts with Yijingliuying. Tests by Lujiu Business Review revealed some limitations of Yijingliuying.

One limitation is the material library. Currently, Yijingliuying uses a non-copyrighted material library, which makes it unsuitable for industrial commercialization associated with specific brands.

Another limitation is the inability to generate videos with human faces due to potential issues with portrait rights. However, it can be used to generate videos without trademarked products.

The third limitation is that the generated videos are only about 30 seconds long. To achieve a similar effect as Sora, concatenation of two video clips is required. It becomes difficult to maintain consistency in content and style.

Tongyi Qianwen is currently most used and most popular in terms of relevant technologies, represented by the graphic generative video technology of "Quanmin Wuwang" (Everyone's Dance King). With just a full-body photo, it can make various popular dance moves. On Bilibili, the total number of views for creative videos featuring historical figures like Emperor Dowager Cixi taking the third driving test is estimated in the tens of millions.

Although it has not reached industrial standards or caught up with Sora in foreign countries, Sora itself has not achieved industrialization either. This means that the two are still not far apart in terms of commercialization. The remaining task is to keep catching up.

Li Mingshun, the chairman of Xingxing AI, holds a similar view. He told Lujiu Business Review that OpenAI still occupies a leading position in the industry, largely built on the foundation of its previous computing power and technological accumulation. Domestic general model manufacturers such as BAT and Byte will continue to catch up. The reason is simple - large-scale general models have become a symbol of basic capabilities for internet companies.

The competition seems to have just begun.

Where does the real advantage of generative videos lie?

Of course, both OpenAI's Sora and various domestic model manufacturers ultimately aim to achieve industrialized and streamlined production of high-quality video content.

However, at present, even Sora, as powerful as it is, still has many immature factors that prevent it from being applied to the industrial domain. The product architect of AI Dynamic Video Solution from ZhiXingYuan (www.creatlyai.cn) told Lujiu Business Review that although Sora appears to be convenient at the moment, generating high-quality videos directly from text through a few prompt words with minimal user mental and operational burdens, it still has limited understanding of the real physical world, resulting in issues in certain scenes. Details such as incorrect candlelight direction, inaccurate object quantities, and distorted spatial object entry and exit are challenging to modify in post-production.

However, there are solutions. As Sora currently has video extension and concatenation functions, users can generate several short video clips and then edit them in post-production. For those lacking knowledge in prompt word engineering, multiple generations and manual post-production are unavoidable.

Additionally, in industrial promotional videos, clients often release new products, such as new down jackets, cars, or mobile phones. However, user materials do not exist in the training set of the video models, which means generating similar products and then performing post-production is necessary.

There is also a difference in demands between professional and non-professional users. For general casual users without commercialization needs, the model is just a trial product, and any newly generated work is a pleasant surprise to them. However, for professional users like directors, if the initial generation is not satisfactory, multiple generations and post-production become burdens in terms of computing power and human labor.

According to a film post-production practitioner, the highest cost in film production processes lies in editing and special effects, that is, secondary processing. If the workflow is not advanced enough, it may increase costs during post-production, thereby affecting project ROI.

If the current generative videos still require significant manual adjustment and cannot achieve a 1:1 representation of real-world scenes, the cost-effectiveness of using AI-generated video materials is not high.

Based on this, a post-production professional believes that AI can directly replace mid-term work such as set construction and shooting. This is because AI's simulation and representation of the physical world can approach real levels through continuous training.

These are just some projected changes that Sora may bring about in the film and television industry. As for gaming, advertising, short video creation, and other specific fields, the transformation will definitely be more significant. The revolutionary changes brought about by AI will undoubtedly be grand. Domestic giants are clearly more willing to make efforts and attempts in the commercial exploration of AI applications.

According to a friend's circle spoiler by Zhou Yahui, "OpenAI will soon release GPT4.5, and it is estimated that it will intentionally choose to release when Anthropic releases Claude 3." The latest iteration of OpenAI, apart from Sora's generative videos, should be the most intriguing innovation for domestic giants involved in large model strategies and business departments. 对于文生视频(文本生成视频)的发展方向,美国和中国大公司已经有不同的选择。美国大公司更倾向于基于AI技术来拓展大型模型的应用,而中国大公司更倾向于基于AI技术来训练和升级自身的原生态大模型。