Cohere, best known for text and embedding models, just shipped a speech recognition model that hits #1 on the HuggingFace Open ASR Leaderboard: 5.42% average WER versus Whisper Large v3’s 7.44%. It’s 2B parameters, Apache 2.0, available on HuggingFace today.

The architecture is a Conformer: a hybrid of CNNs and Transformers. CNNs handle local acoustic features (phonemes, rapid transitions), Transformers handle global context. Interleaving them is the standard trick for ASR; Cohere’s bet is that a dedicated, from-scratch training run focused on WER beats the generalist approach.

For long audio, the model chunks at 35 seconds with overlap and reassembles, so you can run an hour-long recording without VRAM issues.

The limitations worth knowing: no speaker diarization, no timestamps, no automatic language detection (you specify the language upfront), and inconsistent behavior on code-switched audio. It supports 14 languages.

texts = model.transcribe(processor=processor, audio_files=[audio_file], language="en")