Whisper — Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a versatile speech recognition model that has been extensively trained on a wide range of audio samples. It's a multitasking model that can handle multiple tasks, including multilingual speech recognition, speech translation, and language identification.

Isaac Kargar

--

Note: Generative AI services are used as assistants in this blog post!!

Introduction

Pre-trained audio encoders, while able to learn high-quality representations of speech, have limitations due to their unsupervised nature. They lack an equally high-quality decoder, requiring a complex fine-tuning stage to be useful for tasks like speech recognition. Fine-tuning can improve performance but may also lead to brittle and spurious patterns that do not generalize well to different datasets. A model that performs exceptionally on a specific dataset may still make basic errors on another due to these dataset-specific quirks.

Unsupervised pre-training has enhanced the quality of audio encoders significantly, but the lack of an equivalent pre-trained decoder and the need for dataset-specific fine-tuning are key weaknesses. Ideally, a speech recognition system should function reliably across various environments without needing supervised fine-tuning for every deployment.

Speech recognition systems that are pre-trained in a supervised manner across multiple datasets exhibit higher robustness and better generalization. However, the available high-quality speech recognition datasets are limited in size. Newer efforts aim to create larger datasets for speech recognition, often trading off quality for quantity by using weakly supervised speech recognition. Recent work in computer vision has demonstrated that moving beyond gold-standard crowdsourced datasets such as ImageNet to much larger but weakly supervised datasets significantly improves the robustness and generalization of models.

This work introduces Whisper, which scales weakly supervised speech recognition to 680,000 hours of labeled audio data, removing the need for any dataset-specific fine-tuning. The approach focuses on not only English but also includes multilingual and multitask training, with 117,000 hours covering 96 other languages and 125,000 hours of…

--

--

Isaac Kargar

Co-Founder and CIO @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/