Sora, OpenAI's latest AI model, has been introduced as a GenAI that transforms text instructions into realistic and imaginative scenes. This innovative model can generate 1080p movie-like scenes, complete with multiple characters, various types of motion, and detailed backgrounds, solely based on brief or detailed textual descriptions or even a single still image.

While OpenAI's demo page for Sora features enthusiastic statements, such as the one above, the showcased samples from the model do appear notably impressive, especially when compared to other text-to-video technologies available.

In addition to its scene creation capabilities, Sora is proficient in "extending" existing video clips, making an effort to fill in missing details and enhance the overall visual experience.
"Today, Sora is becoming available to red teamers to assess critical areas for harms or risks. We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.

We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon."
Sora operates as a diffusion model, initiating video generation with an initial appearance resembling static noise. It then progressively transforms the video by iteratively eliminating the noise through multiple steps.

Sora exhibits the ability to generate entire videos simultaneously or prolong existing ones, addressing the complex challenge of maintaining consistency in a subject, even during temporary periods when it may go out of view. This is achieved by providing the model with foresight of multiple frames at once.

Similar to GPT models, Sora employs a transformer architecture, enabling enhanced scaling performance.

Our approach involves representing videos and images as collections of smaller data units known as patches, each analogous to a token in GPT. This unified representation allows training diffusion transformers on a broader spectrum of visual data, encompassing different durations, resolutions, and aspect ratios than previously feasible.

Building upon prior research in DALL·E and GPT models, Sora incorporates the recaptioning technique from DALL·E 3. This involves generating highly descriptive captions for visual training data, enabling the model to more faithfully follow user text instructions in the generated video.

Beyond its capability to generate videos solely from text instructions, the model can take an existing still image and animate its contents with precision and attention to detail, creating a video. Additionally, Sora can extend an existing video or fill in missing frames. Further details can be explored in our forthcoming technical paper.

Sora serves as a foundational model for understanding and simulating the real world, a crucial step towards achieving Artificial General Intelligence (AGI).