Home » Video is the next frontier of generative AI

Video is the next frontier of generative AI

by admin
Video is the next frontier of generative AI

The hyper-realistic images generated by artificial intelligence models such as Stable Diffusion, Dall-E or Midjourney have now reached very high levels of likelihood.
In recent weeks, the “photos” of the Pope in a stylish white duvet and a series of fake shots of Donald Trump’s violent arrest – which never took place – have been taken for granted by a worrying number of people. Meanwhile, an image always generated with one of these models won an award at the prestigious Sony World Photography Award.

While the debate is raging about the consequences of these technologies on the perception of truth, on journalism and more generally on the media and on society, researchers are already setting the bar even further. The next milestone to be crossed first is that of realism for text2video models, i.e. the artificial intelligences that manage to even generate videos starting with a simple video prompt.

Artificial intelligence

10 AI-powered services you need to know and use today

by Andrea Nepori


Will Smith eats spaghetti

The AI-generated videos seen so far are nowhere near comparable in quality (and ability to deceive) to static images.
We are still in a preliminary phase, in which the results obtained with the models available to date are still laughable, rather than worrying.
This is the case, for example, of a video that has been running for some time on Reddit and on social networks in which Will Smith is seen eating spaghetti. Or rather, we see a distorted and nightmarish version of the famous actor who performs gestures comparable, with a little imagination to the action described in the prompt.

See also  The most detailed photo of the mysterious "little moon" of Mars is first exposed.

Video is one collation of small two-second clips generated through an AI software called ModelScope, a model released a few weeks ago by Damo Vision Intelligence Lab, a research group within the Chinese giant Alibaba. The model has been trained on millions of images and thousands of videos that derive from a series of datasets used by many researchers in the field. The source of the training data, which also includes previews of images extracted from the stock photography Shutterstock site, also explains the presence of a platform watermark in the final result.

Ldm videos on Nvidia

One of the most promising developments in text2video, however, comes these days from Nvidia, one of the key companies in the AI ​​sector. The research division of the American company, in collaboration with Cornell University, has developed a new model for video creation with artificial intelligence. The model applies a “latent diffusion” model (hence the acronym LDM) to video creation, the same used by the well-known open source software for generating Stable Diffusion images.

Starting with a text description of the result to be obtained, the model is able to create movies with a resolution of up to 2048px by 1280px and a frame rate of 24 frames per second. The clips produced by the AI ​​model have currently a maximum duration of 4.7 seconds. Some examples can be seen on the mini-site set up by Nvidia to illustrate the research.

By rearranging the parameters, the model is also able to create videos that look like they were taken from the dashboard of a speeding car. In this case the movies have a lower resolution of 1024px by 512px, but can last up to five minutes. The possible applications that Nvidia is exploring in this case are not just about making movies, but also about multimedia elements in video games.

The system devised by Nvidia, simplifying a lot, is a sort of generator of sequential images, which are then “aligned” adequately to create movement. The advantage of this approach is that the consumption of computing power required to generate the video is low and can be further optimized. Ample resources are needed for model training, as always in these cases, but videos can also be created on not too powerful computers. The study describing the model is publicly available, while the software remains confined to Nvidia’s laboratories for now.

Doesn’t the masterpiece that an artificial intelligence will make exist?

by Riccardo Luna


Adobe Firefly

If the Video LM model pertains to the field of applied research, Adobe is instead already thinking about the practical applications of generative models in the field of creative practice.
During the NAB in Las Vegas, an important trade show for multimedia production technologies, the company announced a series of updates to its Firefly platform designed for video professionals. Adobe’s solutions don’t go in the direction of creating movies from scratch, but are rather practical applications of artificial intelligence to video editing and production.

In a video teaser of the news it is working on, the company illustrated some of the possible developments. The idea is that they may be available later within Premiere Pro and After Effects, as has already happened in the past for similar announcements that today can already be used within the programs of the Creative Cloud suite.
Thanks to AI it will be possible for example perform quick automatic color correction to videos. Just write a command like “golden hour”, or “cold light”, and the software will directly apply the required effect to all the clips of a project.

Another possible scenario: imagine you have a video interview and you need some overlays, i.e. pieces of film to superimpose on the subject’s speech. Here, Adobe’s AI will spare the editors the search, identifying within the clips of a project those that best suit what the interviewee is talking about. The “engine” in this case is the automatic transcription system that Adobe has already made available within Premiere Pro.

With a similar system, but applied to audio, Adobe promises another interesting development: the possibility to generate sounds suitable for the video content automatically. Imagine in this case having a clip of a wave breaking on the shoreline: lo and behold, the adobe AI will be capable of generating the sound of the sea based on video content analysis. Not only that: the audio clip thus generated will always be original and there will be no need to even obtain the rights to use the sound.

Another future application that screenwriters will love is the ability to create pre-visualizations in 3D or 2D that can help directors and operators shoot a scene better. The practice, widespread in the cinematographic field, will thus be within the reach of even small productions, which usually could not afford the necessary costs and resources. Finally, the application of AI to the generation of text effects is also interesting: in this case, a simple command is enough to create an animation that would otherwise require a few hours of work in After Effects.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy