About Sora: Known and Unknown

晚点LatePost Mon, Feb 26 2024 01:12 AM EST

On February 16th, OpenAI unveiled a series of 60-second videos produced by artificial intelligence, showcasing the results of their text-to-video tool named Sora for the first time. Dubbed "Sora," meaning "sky" in Japanese, the tool symbolizes limitless creative potential.

Text-to-video AI tools are not entirely new. Prior offerings include Runway's Runway Gen-1 and Gen-2, Google's Imagen Video and Phenaki, Meta's Make A Video, among others.

Previous tools often required frame-by-frame production of individual images, which were then stitched together to create videos. This approach had limitations as even with shared keywords, there could be vastly different results between frames. Consequently, video length was strictly limited to avoid issues like inconsistent character appearances or other discontinuities.

Sora's primary advantage over previous tools lies in its significant breakthrough in video length and coherence. According to OpenAI's technical documentation and expert analysis, Sora utilizes a "spatio-temporal patching" technique. This enables it to parse text requests and segment predetermined videos into smaller parts, each with spatial and temporal information, before generating them individually. The schematic diagram of the "Space-Time Patch" technology in the OpenAI technical document.

This enables Sora to ensure video consistency in a more refined manner, greatly enhancing the details within the video. In the simulated videos released by Sora, the advantages brought by this coherence include better simulation of simple interactions between characters and environments, expanding videos both forwards and backwards, blending two videos into one coherent video, and other unprecedented functionalities.

Furthermore, Sora excels in physical modeling and composition. Unlike previous tools that uniformly crop input images into fixed formats, Sora can directly generate videos according to the original image proportions and resolutions. This means that Sora is better able to grasp the main content of the video and can simulate the movements of the same object from different angles. One of the screenshots from a demo video released by OpenAI shows bustling Tokyo covered in beautiful snow. The command accompanying it reads, "Beautiful snowy Tokyo is bustling. The camera glides through the crowded city streets, following a few people enjoying the beautiful snowy weather and shopping at nearby stalls. Beautiful cherry blossom petals dance in the wind along with the snowflakes."

However, amidst the awe at its capabilities, there remain many unknowns. For instance, it's unclear whether Sora can support languages other than English, and there's no indication of when it will be opened up to more users. Currently, only a small group of "visual artists, designers, and filmmakers" along with specific security testers have been granted access.

The technical documentation on the official website only provides a brief outline of the technology's general principles, mentioning the use of previous-generation technologies like GPT and DALLE-3 for text analysis, but without disclosing training data sets and model structures as GPT-3 did in its papers.

Professor Xie from New York University pointed out that Sora may be leveraging a technology model developed by him and another researcher, while there are claims that Sora uses Unreal Engine 5 to create some of its training data. OpenAI consistently refuses to disclose how many videos or their sources the system has learned from, stating only that training includes both publicly available videos and videos with copyright owner permissions.

This secrecy seems to be a standard move lately when major companies release new versions of large models. On the same day Sora was released, Google rolled out the Gemini 1.5 update, also available for a limited preview to a small group of developers and enterprise clients. An analysis by the Stanford University Foundation for Model-based Research on ten major AI models showed that none of the major model developers could provide sufficient transparency.

OpenAI's explanation for not yet releasing tools and more details is the need to reduce misinformation, hate speech, and bias in generated videos, and all generated videos are watermarked, although the watermark can be removed. Considering that short videos can already have a significant impact on politics, the regulatory pressure on the field of artificial intelligence is unprecedentedly high. (Intern Shang Yi)

pre：2024 Apple iPad New Models: Bigger Air and Thinner Pro Exposed

next：Game-Changer: China Cars Spark Global Buying Frenzy

About Sora: Known and Unknown

Navigation

Related Articles