Home » OpenAI shows Sora, an AI that creates realistic videos of anything

OpenAI shows Sora, an AI that creates realistic videos of anything

by admin
OpenAI shows Sora, an AI that creates realistic videos of anything

OpenAI has created a new and surprising generative video model called Sora which can take a short text description and turn it into a detailed, high-definition movie clip up to one minute long.

Based on four sample videos that OpenAI shared with MIT Technology Review ahead of today’s announcement, the San Francisco-based company has pushed the boundaries of what’s possible with text to video generation (a hot new line of research that we point out as one of the 10 Emerging Technologies for 2024.

“We believe that creating models that can understand video and all the complex interactions in our world is an important step for the AI ​​systems of the future,” says Tim Brooks, scientist at OpenAI.

But there is a disclaimer. OpenAI gave us a preview of Sora (which means sky in Japanese) under conditions of strict secrecy. In an unusual move, the company would only share information about Sora if we agreed to wait for news about the model to be made public before seeking input from outside experts. [OpenAI no ha publicado ningún informe técnico ni ha demostrado que el modelo funcione realmente. Y dice que no lanzará Sora en breve].

PROMPT: The animated scene features a close-up of a short, fluffy monster kneeling next to a melting red candle. The art style is 3D and realistic, with special attention to lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster stares at the flame with wide eyes and open mouth. His pose and expression convey a sense of innocence and joy, as if he were exploring the world around him for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image. (Credit: OpenAI)

PROMPT: A gorgeously rendered paper world of a coral reef, teeming with colorful fish and sea creatures (Credit: OpenAI)

The first generative models that could produce videos from text fragments appeared in late 2022. But the first examples of Meta Google and a startup called Runway They had bugs and low definition. Since then, technology has improved rapidly. Runway’s gen-2 model, launched last year, can produce short clips of similar quality to big studio animations. But most of these examples still only last a few seconds.

Sora sample videos from OpenAI They are high definition and full of details. OpenAI also says it can generate videos up to one minute long. A video of a Tokyo street shows that Sora has learned how objects fit together in 3D: the camera zooms into the scene to follow a couple walking past a row of shops.

See also  Ahead of the 2024 Election Results Announcement, Here is the Latest Quick Count-Real Count Update

OpenAI also claims that Sora handles occlusion well. One of the problems with existing models is that They may not keep track of objects when they disappear from view. For example, if a truck passes a traffic sign, the sign may not appear again afterwards.

On a video of a paper underwater scene marketSora has added what look like cuts between different pieces of footage, and the model has maintained a consistent style between them.

Is not perfect. In the Tokyo video, the cars on the left appear smaller than the people walking next to them. They also appear and disappear between the branches of the trees. “There’s a lot left to do in terms of consistency over time. For example, if someone disappears from view for a long time, he doesn’t come back. The model forgets that he should be there,” Brooks details.

Technological provocation

As impressive as they are, there is no doubt that the sample videos shown here have been selected to show the best of Sora. Without more information, it is difficult to know to what extent they are representative of typical model performance.

It may take us a while to find out. OpenAI’s announcement of Sora is a technological provocation and the company says it has no plans to release it to the public. Instead, OpenAI will begin sharing the model with third-party security testers today for the first time.

Specifically, the company is concerned about potential misuse of fake but photorealistic videos. “We are being careful with the rollout and making sure we have all the bases covered before putting this in the hands of the general public,” explains Aditya Ramesh, a scientist at OpenAI, who created the company’s text-to-image DALL-E model. .

But OpenAI plans to launch the product in the future. In addition to security testers, the company is also sharing the model with a select group of video creators and artists for information on how to make Sora as useful as possible for creative professionals. “The other goal is to show everyone what’s on the horizon and give a preview of what these models will be capable of,” Ramesh says.

See also  Hacking: GPT-4 finds security holes in websites

To create Sora, the team adapted technology from DALL-E 3, the latest version of OpenAI’s flagship text-to-image model. Like most text-to-image models, DALL-E 3 uses what is known as a diffusion model. These models are trained to convert a tangle of random pixels into an image.

Sora takes this approach and applies it to videos instead of still images. But the researchers also added another technique to the mix. Unlike DALL-E or most other generative video models, Sora combines his diffusion model with a type of neural network called a transformer.

Transformers are great for process long sequences of data, like words. That has made them the special sauce within large language models such as GPT-4 from OpenAI and Gemini from Google DeepMind. But videos are not made of words. Instead, the researchers had to find a way to cut the videos into pieces that could be treated as if they were. The approach they came up with was to divide the videos in space and time. “It’s like you have a stack of all the video frames and you cut little cubes out of it,” Brooks says.

The transformer inside Sora can then process these chunks of video data in the same way that the transformer inside a large language model processes words in a block of text. The researchers say this allowed them train Sora in many more varieties of content than other text-to-video models, including different resolutions, durations, aspect ratios, and orientation. “This really helped the model and we don’t know of any other work that has done this,” Brooks notes.

PROMPT: several giant woolly mammoths approach walking across a snowy meadow, their long woolly fur moving slightly in the wind as they walk, snow-covered trees and spectacular snow-capped mountains in the distance, mid-afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is impressive and captures the large furry mammal with beautiful photography and depth of field (Credit: OpenAI)

PROMPT: The beautiful, snowy city of Tokyo is very lively. The camera moves through the bustling city streets, following various people enjoying the beautiful snowy weather and shopping at nearby stalls. Beautiful sakura petals fly in the wind along with snowflakes (Credit: OpenAI)

See also  President Xi Jinping to visit France, Serbia, and Hungary: Ministry of Foreign Affairs discusses expectations and arrangements

“From a technical perspective, it seems un very significant progress” says Sam Gregory, executive director of Witness, a human rights organization that specializes in the use and misuse of video technology. “But the coin has two sides. Expressive capabilities offer the potential for many more people to become storytellers using video. And there are also real possibilities for misuse,” he believes.

OpenAI is well aware of the risks that come with a generative video model. We are already seeing the large scale misuse of images deepfake . Photorealistic video takes this to another level.

Gregory points out that this technology could be used to misinform about conflict zones or protests. The variety of styles is also interesting, she says. If shaky images could be generated that look like they were taken with a phone, they would look even more authentic.

The technology isn’t there yet, but generative video went from zero to Sora in just 18 months. “We’re going to enter a universe where there will be fully synthetic content, human-generated content, and a mix of both,” Gregory says.

The OpenAI team plans to build on the security testing it conducted last year for DALL-E 3. Sora already includes a filter that runs on all cues sent to the model and that will block requests for violent, sexual or hateful images, as well as Images of well-known people. Another filter will look at generated video frames and block material that violates OpenAI security policies.

OpenAI says it is also adapting a fake image detector developed for DALL-E 3 for use with Sora. And the company will incorporate industry standard C2PA labels , metadata indicating how an image was generated, across Sora’s entire output. But these measures are far from infallible. Fake image detectors are unpredictable. Metadata is easy to remove and most social networks remove it from the images their users upload by default.

“We will definitely need to get more feedback and learn more about the risks that need to be addressed around videos before it makes sense to launch this,” Ramesh explains.

Brooks agrees: “One of the reasons we’re publishing this research now is so we can start to have information and be able to work on knowing how to implement it safely,” he says.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy