After announcing the plan in November last year, Google recently published their own Universal Speech Model (USM) API and research results. The parameters have 12 million hours of speech and 28 billion sets of training, covering more than 300 languages. Now it can support more than 100 languages automatically. identify. In terms of supported languages and database size, it is undoubtedly a nuclear-bomb-level model (and it is still under development, with the ultimate goal of supporting 1,000 languages). Here are a few highlights of USM:
The Self-Supervised Learning Trilogy
At present, the biggest challenge of automatic speech recognition (ASR) is that the traditional supervised learning method lacks scalability and is time-consuming and laborious. The model itself needs to be improved in a more efficient way to expand the language and discrimination quality.
Google’s approach is “continuous self-supervised learning and fine-tuning”. The detailed method is that the first step is to use the self-supervised learning method BEST-RQ, which can analyze and learn a large amount of speech data without external supervision (this step alone is 80% of the workload); readers can imagine that this stage is completely composed of The machine supervises and learns by itself, without relying on human beings at all.
The second step is to use the multi-objective supervised pre-training model to integrate knowledge from other data, mainly by text injection, BEST-RQ and supervised loss function. The third is to only let the supervised loss function intervene to fine-tune the terminal tasks. Google said that the output of the first and second stages is very good, and even the third stage only accounts for 5% of the workload, so the overall model can be obtained with a very good quality
Yeah, so you watch YouTube differently
I wonder if any readers have felt that the quality of YouTube’s real-time speech recognition and translation has improved? That’s right, that’s because Google has used this version of USM on YouTube, especially to achieve a word error rate (WER) of less than 30% on 73 languages with less usage (an average of less than 3000 hours of data).
Whisper claims victory over OpenAI
Of course, Google can’t help but compare other people’s things. On the performance of American English, WER is 6% lower than other state-of-the-art models; and compared with OpenAI’s Whisper (large-v2), among the 18 languages with WER lower than 40% on Whisper, Google’s USM WER is on average 32.7% lower than Whisper. (Simply speaking, Whisper is more accurate).
At present, Google has released the paper, and at the same time allows researchers to apply for the USM API.
Review Editor: Jocelyn
Further reading: