How EMO works, Alibaba’s AI that makes any photo speak and sing

by admin March 1, 2024

March 1, 2024

How EMO works, Alibaba’s AI that makes any photo speak and sing

A new model of artificial intelligence, called EMOit allows to animate static images – as photos or works of art – in a surprising way.

Whether it’s a person, an illustration or a face in a work of art, EMO is capable of generate facial expressions and natural head movements based on the audio – spoken or sung – that is fed to the AI.

On social media, users have shared the first fruits of this new generative AI:

To the left, in the clip above, is *a single* frame of one of the first cutscenes generated by Sora, OpenAI’s new text-to-video artificial intelligence. You will remember it is the one in which a woman walks on a street in Tokyo, with the metropolis brightly lit by neon signs.

On the right, that frame was animated by another Generative AI developed by Alibaba researchers, the Chinese e-commerce giant. The model is called EMO and produces expressive video photos starting from an audio. The limitation (for now) is that the faces and gazes animated by the AI will retain the position and orientation they have in the reference frames.

In practice, Alibaba researchers say, “by providing a single reference image and audio – spoken or sung content – the model is able to animate the people portrayed with accurate facial expressions and head movements.”

The result is extraordinary. Sora’s “woman in red” “performs” with a perfect lipmoves her eyebrows, her head and more generally adapts her facial expressions to the intonation, pauses and even breathing of the piece that has been “assigned” to her: Don’t Start Now on Dua Lipa.

This happens because EMO, in short, recognizes the sound wave and generates individual video frames that reflect it. “This allows him to capture the subtle movements and individual quirks associated with natural speech.”

EMO is doing well, judging by the first released demos, even with simple speech. It is impressive in its fluidity and realism, for example, Joaquin Phoenix’s monologue as Joker taken from a simple still taken from the 2019 film about the villain of Gotham City.

I almost believed it!!
It almost felt like the original Joker from The Dark Knight movie ??

This is my favorite video from Alibaba’s Audio2Video lipsync model.

Look closely and you’ll see why pic.twitter.com/B97pizLsSg

— Halim Alrasihi (@HalimAlrasihi) February 28, 2024

“Our model supports songs in various languages and brings to life different portrait styles [dalle foto ai disegni, alle opere d’arte: ha fatto parlare anche la Gioconda]. And it intuitively recognizes pitch variations in audio” write Alibaba researchers in their paper published on arXiv, a platform that allows sharing scientific studies in various disciplines that have not yet undergone formal peer reviewi.e. other experts in the relevant research field.

How EMO works, Alibaba’s AI that makes any photo speak and sing

Share this:

Related

Pertamina Holds Non-Subsidized Fuel Prices, ESDM Emphasizes No Intervention

European Medical Device Nomenclature, collection of update proposals has begun

You may also like

Leave a Comment Cancel Reply