The Final Battle! Exclusive Interview with OpenAI's Sora: AI Video Models Still in the Era of GPT-1

Thu, May 02 2024 07:57 AM EST

New Wisdom Times Report

Editor: Alan

[New Wisdom Times Overview] When Sora emerges, who can compete! Recently, the three leaders of the Sora team, Aditya Ramesh, Tim Brooks, and Bill Peebles, were interviewed to explain the changes that Sora brings in areas such as simulating reality, predicting outcomes, and enriching human experiences.

In the field of video generation, the unanimous opinion is: When Sora appears, who can compete!

However, how do members of the Sora team, who are at the forefront, view this? Recently, Sora's three leaders, Aditya, Tim, and Bill, were interviewed.

The result? Quite steady!

After watching the entire interview video, you'll notice that besides being young and promising, the team's thinking and planning are very stable.

Stable to the point where there's practically no planning.

Stable as if they know they'll win steadily, or they don't really care about winning, just focusing on steadily improving the model.

Perhaps it's the corporate culture of OpenAI? Everyone freely challenges each other, and if someone surpasses me on the leaderboard, I'll just branch out and release an update to reclaim the throne. ps: For the audience who are not yet familiar with these three experts and other team members, you can refer to this issue.

Regarding the entire interview video, the editor has summarized it into four points:

Simulating reality leads to AGI

AGI represents a hopeful future, but with Sora, this potential goes beyond mere imagination.

Sora bridges the gap between current AI capabilities and advanced general intelligence (AGI) by simulating complex environments within neural networks. With Sora's development, it will be able to fully comprehend our three-dimensional world, making a leap towards more sophisticated artificial intelligence systems.

Enriching Human Experience

Sora has become a medium for creativity, allowing users to create innovative art and narratives.

Simultaneously, Sora's exploration enhances traditional forms of content creation, offering a new dimension for storytelling and sharing experiences. In the future, content provided in various fields from entertainment to education will be more immersive and interactive.

The three experts also discussed Sora's technological foundation on-site, including aspects such as digital modeling, physics engines, and video generation.

Furthermore, in terms of practical deployment and optimization, considerations need to be given to accessibility and affordability to ensure that Sora's capabilities can reach a wide audience without compromising quality and effectiveness. Values

Safety is a crucial aspect that should never be overlooked during a journey.

Especially concerning issues of misinformation and the misuse of AI-generated content, it requires both technological efforts and relevant guidelines and regulations.

The trio reassures: no rush, our Sora is currently receiving feedback from artists and ethicists to ensure alignment with societal values and safety standards.

Simulate everything, until AGI. The team believes that Sora is truly on the critical path to AGI.

For example, let's revisit some of the stunning scenes that Sora has brought us: Winter in Tokyo, a crowd gathers. People converse, hold hands, while others sell goods at nearby stalls.

This scene, with its myriad complexities, aptly illustrates how neural networks can simulate highly intricate environments and worlds within the scope of their weights, predicting future behaviors. To create truly realistic videos, the model must learn how people work, interact with others, and think.

— Not just people, but also animals, and any object you want to model.

And as Sora's scale continues to expand, she may become another concept stock — the world model. Anyone can interact with this "world simulator," each person can have their own simulator, and at any time experience simulated events, simulate life (or simulate love?)

In this way, humanity will help the model step by step towards that magnificent endpoint.

"This will happen."

How Sora influences the world

Explore creative potential, enrich human experience

The world model is in the not-so-distant future, while some experiences are happening right now, right around us.

When Sora is launched, many people will be attracted by the beautiful images, shocked by the reflection of the water panda. However, now more and more people are starting to use it, allowing professional creators to unleash their creativity and enabling ordinary individuals to showcase their ideas. Tim

The Sora team provided two examples, the first being a short story called "airhead." Differing from traditional forms of content creation such as special effects and editing, Sora helps creators unlock a cool new way to add a fresh dimension to storytelling and sharing experiences.

Another example is a multi-camera scene at the New York Zoo created by Bill using Sora. As someone who enjoys generating creative content but lacks the skills to bring it to life, using models like Sora makes it easy to create eye-catching works.

Bill achieved something he loved through prompts and iterations, all in less than an hour.

"I had a great time."

From short films to world-building models

The journey from shorts to epics is not just the evolution of the film industry but also the future of Sora.

Just as we've seen Pixar evolve over 30 years, more and more people will use video generation models to create an increasing number of films. At the same time, Tim believes that people will find entirely new ways to use models, which will be completely different from the current media we are accustomed to.

For example, as mentioned above, with the world model, creators work in a very different paradigm, simulating what they want users to see. People can interact with the content, leading to unexpected outcomes.

Another area in urgent need of world models is robotics. Bill argues that robots can learn a lot from the virtual worlds constructed by models, something unmatched by other forms.

Returning to the scene in Tokyo once again, how legs move and make precise physical contact with the ground.

The knowledge about the physical world learned by models from training on raw videos will be able to be transferred to robots or other fields at a low cost.

Space-time patches and new architectures.

More computational power, stronger performance.

Sora builds on OpenAI's DALL·E model (Diffusion model) and GPT model (Transformer) research,

The Diffusion model is a process of creating data, starting from a noise file, repeatedly removing noise to form the final result. The Transformer provides powerful learning capabilities and scalability. With more computation and more training data, Sora's abilities will continue to grow stronger. The team's experimental results have demonstrated a positive correlation between model performance and computational power, and they firmly believe this trend will continue.

One of the benefits of using Transformers is the ability to inherit all the great attributes in the field, such as language.

Analogous to video data, one also needs to construct corresponding loss functions and find ways to achieve better losses without increasing the required computational resources - this is also the direction the team is working hard on.

The secret of generating long videos

One key factor in the success of large language model paradigms is the concept of tokens.

The internet is filled with various types of text data, including books, code, and mathematics. Large Language Models (LLMs) unify them into tokens, enabling training on such a wide variety of data.

Previous visual generation models did not grasp this concept. Before Sora, people typically used 256 × 256 resolution images or 256 × 256 videos for training, which limited the length of video generation and constrained the information the model could access.

In Sora, the team introduced the concept of spatiotemporal blocks, where both images and videos, regardless of size, are treated as individual small blocks.

This is the token relative to the visual model.

The result is that Sora has universal capabilities. It can generate not only fixed-time 720p videos but also vertical videos, widescreen videos, and images.

Starting from scratch

Before Sora, many had been extending image generation models to eventually create a few seconds of video.

But let's set a small goal first: How do we make a one-minute HD video?

To achieve this goal, we need to abandon traditional methods and start from scratch. Data needs to be broken down in a very simple way, and the model needs to be scalable. Thus, the Sora architecture was born.

"This is the first visual content generation model with the breadth of a language model."

Creating a Sora that everyone can use Aditya

Values

Safety is indeed a rather complex topic.

For instance, how should models handle harmful content images, such as misinformation? Should users be allowed to generate images with offensive words?

How much responsibility should companies deploying this technology bear? How much effort should social media companies put into showing users the credibility of content? How accountable should users be for what they create?

We need to ponder these questions seriously, ensuring alignment with human values without stifling future creativity.

Democratization

Currently, generating videos is very resource-intensive, and users may have to wait several minutes to get their results.

In the future, this technology should benefit everyone, and teams are working towards this goal.

Of course, in the process of democratization, we must also be very cautious of misinformation and any surrounding risks.

From Approximate World Models to High-Fidelity Predictions

Sora has not undergone training on 3D information but has learned spatial relationships from a vast number of videos.

Sora is learning about our human world, yet it may be closer to reality than we are.

The way humans think about things is flawed; in fact, we cannot make very accurate long-term predictions.

As a world model, Sora will provide this capability and may one day be smarter than humans.

Give it more computing power and data, and it will improve.

And as the scale increases, the best way to learn scalable intelligence is to predict data—just as LLM does.

Sora's scaling law is far from complete, or rather, it has just begun.

"This is an exciting moment, and we look forward to the capabilities of future models."

Reference:

https://twitter.com/saranormous/status/1783505771097112703

pre：OpenAI may provide AI capabilities for iPhone; Xiaomi responds to Lei Jun being trapped in an Ideal car: Boss chats very well; Taobao and JD.com cancel 618 presale | Geek Morning News

next：Apple Vision Pro sees significant price drop on the second-hand market, early adopters feeling regretful

The Final Battle! Exclusive Interview with OpenAI's Sora: AI Video Models Still in the Era of GPT-1

Navigation

Related Articles