Home » How EMO works, Alibaba’s AI that makes any photo speak and sing

How EMO works, Alibaba’s AI that makes any photo speak and sing

by admin
How EMO works, Alibaba’s AI that makes any photo speak and sing

A new model of artificial intelligence, called EMOit allows to animate static images – as photos or works of art – in a surprising way.

Whether it’s a person, an illustration or a face in a work of art, EMO is capable of generate facial expressions and natural head movements based on the audio – spoken or sung – that is fed to the AI.

On social media, users have shared the first fruits of this new generative AI:

To the left, in the clip above, is *a single* frame of one of the first cutscenes generated by Sora, OpenAI’s new text-to-video artificial intelligence. You will remember it is the one in which a woman walks on a street in Tokyo, with the metropolis brightly lit by neon signs.

On the right, that frame was animated by another Generative AI developed by Alibaba researchers, the Chinese e-commerce giant. The model is called EMO and produces expressive video photos starting from an audio. The limitation (for now) is that the faces and gazes animated by the AI ​​will retain the position and orientation they have in the reference frames.

In practice, Alibaba researchers say, “by providing a single reference image and audio – spoken or sung content – the model is able to animate the people portrayed with accurate facial expressions and head movements.”

The result is extraordinary. Sora’s “woman in red” “performs” with a perfect lipmoves her eyebrows, her head and more generally adapts her facial expressions to the intonation, pauses and even breathing of the piece that has been “assigned” to her: Don’t Start Now on Dua Lipa.

This happens because EMO, in short, recognizes the sound wave and generates individual video frames that reflect it. “This allows him to capture the subtle movements and individual quirks associated with natural speech.”

See also  Samsung and Xiaomi are looking into the tube

EMO is doing well, judging by the first released demos, even with simple speech. It is impressive in its fluidity and realism, for example, Joaquin Phoenix’s monologue as Joker taken from a simple still taken from the 2019 film about the villain of Gotham City.

“Our model supports songs in various languages ​​and brings to life different portrait styles [dalle foto ai disegni, alle opere d’arte: ha fatto parlare anche la Gioconda]. And it intuitively recognizes pitch variations in audio” write Alibaba researchers in their paper published on arXiv, a platform that allows sharing scientific studies in various disciplines that have not yet undergone formal peer reviewi.e. other experts in the relevant research field.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy