Home » Google announces nuclear-bomb-grade AI API: Universal Speech Model (USM) – INSIDE

Google announces nuclear-bomb-grade AI API: Universal Speech Model (USM) – INSIDE

by admin
Google announces nuclear-bomb-grade AI API: Universal Speech Model (USM) – INSIDE

After announcing the plan in November last year, Google recently published their own Universal Speech Model (USM) API and research results. The parameters have 12 million hours of speech and 28 billion sets of training, covering more than 300 languages. Now it can support more than 100 languages ​​automatically. identify. In terms of supported languages ​​and database size, it is undoubtedly a nuclear-bomb-level model (and it is still under development, with the ultimate goal of supporting 1,000 languages). Here are a few highlights of USM:

The Self-Supervised Learning Trilogy

At present, the biggest challenge of automatic speech recognition (ASR) is that the traditional supervised learning method lacks scalability and is time-consuming and laborious. The model itself needs to be improved in a more efficient way to expand the language and discrimination quality.

Google’s approach is “continuous self-supervised learning and fine-tuning”. The detailed method is that the first step is to use the self-supervised learning method BEST-RQ, which can analyze and learn a large amount of speech data without external supervision (this step alone is 80% of the workload); readers can imagine that this stage is completely composed of The machine supervises and learns by itself, without relying on human beings at all.

Photo Credit: Google

Photo Credit: Google

The second step is to use the multi-objective supervised pre-training model to integrate knowledge from other data, mainly by text injection, BEST-RQ and supervised loss function. The third is to only let the supervised loss function intervene to fine-tune the terminal tasks. Google said that the output of the first and second stages is very good, and even the third stage only accounts for 5% of the workload, so the overall model can be obtained with a very good quality

See also  Promoting development through management: generative AI innovation must dare to try and not be afraid of failure-Sina

Yeah, so you watch YouTube differently

I wonder if any readers have felt that the quality of YouTube’s real-time speech recognition and translation has improved? That’s right, that’s because Google has used this version of USM on YouTube, especially to achieve a word error rate (WER) of less than 30% on 73 languages ​​with less usage (an average of less than 3000 hours of data).

Photo Credit: Google

Photo Credit: Google

Whisper claims victory over OpenAI

Of course, Google can’t help but compare other people’s things. On the performance of American English, WER is 6% lower than other state-of-the-art models; and compared with OpenAI’s Whisper (large-v2), among the 18 languages ​​with WER lower than 40% on Whisper, Google’s USM WER is on average 32.7% lower than Whisper. (Simply speaking, Whisper is more accurate).

At present, Google has released the paper, and at the same time allows researchers to apply for the USM API.

Review Editor: Jocelyn

Further reading:

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy