Home » MIT Tech Review: Video generation revolution, Open AI announces new model “Sora”

MIT Tech Review: Video generation revolution, Open AI announces new model “Sora”

by admin
MIT Tech Review: Video generation revolution, Open AI announces new model “Sora”

OpenAI has developed a new video generation model called “Sora” (derived from the Japanese word for “sky”). It’s an amazing model that allows you to create detailed, high-resolution film clips of less than a minute from a short text description.

The four sample videos that OpenAI provided to MIT Technology Review before the announcement were used by the company to highlight the potential of its text-to-video generation technology (a new research direction that this magazine has identified as a hot trend in 2024). It shows that it has been expanded.

“Building models that can understand video and understand all the highly complex interactions that exist in our world is the future of all artificial intelligence (AI) systems,” said OpenAI scientist Tim Brooks. I think this is an important step towards achieving this.”

However, there is a disclaimer. Open AI provided information to Barron’s about the Sora preview on the condition that it “not seek outside expert opinion until the news is publicly available.” OpenAI has not published any technical reports, and we have not verified whether the model actually works. Additionally, the company has no plans to release Sora anytime soon.

Prompt: Animation scene. A close-up of a short, fluffy monster kneeling next to a melted red candle. The art style is realistic 3D, with an emphasis on lighting and textures. This image conveys the monster’s surprise and curiosity. Because he was staring at the flames with his eyes wide open and his mouth open. The monster’s pose and expression convey an innocent and playful attitude, as if it were exploring the world around it for the first time. Warm colors and dramatic lighting further enhance the pleasant atmosphere of the image.
Provided by: Open AI Prompt: A coral reef with a gorgeously rendered papercraft world.There are many colorful fish and sea creatures
Provided by: Open AI

The first generative models that can generate videos from text fragments are expected to arrive in late 2022. But early samples from Meta, Google, and a startup called Runway had a number of flaws, and the images were grainy. Since then, technology has advanced rapidly. Runway’s Gen-2 model, released last year, can now create short clips with quality comparable to animation from major studios. However, most of the samples are still only a few seconds long.

Open AI’s Sora sample videos are high-resolution and detailed. Open AI says it can generate videos up to 1 minute long. A video depicting a Tokyo skyline shows Sora learning how to combine objects three-dimensionally. Her camera swoops down and follows the couple as they pass by a shopping street.

See also  Atletico Madrid are prepared to let go of Jan Oblak, who's excited about Chelsea

OpenAI claims that Sora is also good at representing occlusion (the way objects in the foreground obscure objects in the background). However, a problem with existing models is that they may not be able to track objects when they disappear from view. For example, if a truck passes a road sign, the sign may no longer appear.

Also, in the video of the underwater scene of the paper craft, something like a cut was inserted between the videos, and this style was maintained throughout the video.

Video isn’t perfect. In the Tokyo video, the car on the left looks smaller than the person walking next to it. Cars are moving in and out of the tree branches. “There are certainly issues that need to be fixed from a long-term consistency standpoint,” Brooks says. “For example, if someone disappears from view for a long time, they never come back. The model forgets they were supposed to be there.”

“Technical preview video”

The sample videos featured here are certainly amazing, but they were definitely hand-picked to show Sora at her best. Without more information, it’s unclear how representative these videos are of Sora’s typical output.

It looks like it will take some time before we find out. According to OpenAI, this Sora announcement is a “technical preview video” and there are currently no plans to release Sora to the public. Instead, OpenAI plans to make Sora available to third-party safety testers for the first time.

OpenAI is particularly concerned about the potential for photorealistic fake videos to be misused. Aditya Ramesh, an open AI scientist and developer of the company’s text-to-image generation model Dall-E, says, “At this stage, before making Sora publicly available, we are paying close attention to the deployment and ensuring that all infrastructure We have confirmed that there are no problems.”

However, open AI has its sights set on commercialization in the future. In addition to safety testers, the company is opening Sora to a select group of video producers and artists. The purpose is to get feedback so that the model can become an extremely useful tool for creative professionals. “Another goal is to give people a glimpse into the future of these models and give them a preview of what features will be available,” Ramesh says.

See also  Wheat story - In other words

To build Sora, the team adapted the technology behind OpenAI’s flagship text-to-image generation model, Dall-E3, into Sora. Like most text-to-image generation models, DALL-E 3 uses a diffusion model. Diffusion models are trained to convert pixel fuzz (random data) into images.

Sora applies this technique to videos rather than images. But the researchers also added another technique. Unlike DALL-E and many other video generation models, Sora combines its diffusion model with a neural network technology called “Transformers.”

Transformers are good at processing long data strings such as words. As such, it is used as a key technology within large-scale language models such as Open AI’s GPT-4 and Google DeepMind’s Gemini. However, videos are not made up of words. To treat videos like words, the researchers had to find a way to break them up into “chunks.” What they came up with is a method that divides the video into dice in both space and time. “It’s like stacking all the video frames and cutting out little cubes from it,” Brooks says.

The transformers in Sora can process “chunks” in video data in the same way that transformers in large-scale language models process words in blocks of text. According to the researchers, this method allowed them to train Sora on more types of videos than other text-to-video conversion models. Training includes differences in resolution, time, aspect ratio, and orientation. “Training like this is very helpful for Sora,” Brooks says. “It’s a method that isn’t used in any other existing training.”

Prompt: Several giant woolly mammoths approach, trampling through the snowy field. The long, wool-like hair of a mammoth flutters lightly in the wind as you walk. Snow-covered trees, majestic snow-capped mountains in the distance, and the light of late afternoon with a thin layer of clouds. The sun is high in the distance, shining warmly over the area, and the camera’s low angle captures this large, hairy mammal with beautiful images and depth of field.
Provided by: Open AI Prompt: The beautiful, snow-covered city of Tokyo is bustling with people. The camera moves through the busy city streets, enjoying the beautiful snowy scenery and following several people shopping at a nearby store. Gorgeous cherry blossom petals are dancing in the wind along with snowflakes.
Provided by: Open AI

See also  Ukraine, Sasha's story in Kherson: "So I saw my city fall"

Open AI is well aware of the risks associated with video generation models. We are already seeing large-scale abuse of deepfake images. Photo-realistic videos take these risks to another level.

The development team plans to utilize safety tests conducted on DALL-E 3 last year. Sora already has filters that run on every prompt sent, blocking violent, sexual, and hateful images, as well as requests for images of known people. Additionally, there is another filter that examines the frames of the generated video and blocks material that violates OpenAI’s safety policies.

According to Open AI, the fake image detector developed for DALL-E 3 has been adapted to be used on Sora as well. moreover,Industry standard C2PA tag(metadata that describes how an image was generated) will be embedded in all Sora output. However, such measures are far from certain. This is because the accuracy of detecting fake images is uneven. Additionally, metadata (C2PA tags) can be easily deleted. Most social media platforms remove metadata from uploaded images by default.

“We need to get more feedback and learn more about the types of risks that need to be addressed with video before it makes sense to release Sora,” Ramesh says.

Brooks agrees. “One of the reasons we’re talking about this research now is because we want to have the information we need to start working on finding ways to safely deploy it.”

Popular article ranking

  • How virtual power plants are shaping tomorrow’s energy system
    Explanation: What is a “virtual power plant” that changes the concept of power transmission?

  • How Russia killed its tech industry
    How Russia’s tech industry collapsed after the Ukraine invasion

  • Why BYD is breaking into shipping
    China’s tech situation: BYD’s chances of success as it follows Japan and South Korea and enters the shipping industry

  • Google’s Gemini is now in everything. Here’s how you can try it out.
    Google fully deploys generated AI “Gemini”, switching from “Bird”

  • will douglas haven [Will Douglas Heaven]US Senior Editor for AI As Senior Editor for AI, I cover new research, emerging trends, and the people behind them. Previously, he was founding editor of Future Now, the BBC’s technology and politics website, and technology editor of New Scientist magazine. He holds a PhD in Computer Science from Imperial College London and has knowledge of robot control.

    You may also like

    Leave a Comment

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

    Privacy & Cookies Policy