GitHub - QwenLM/Qwen3-ASR: Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
Key Points
- 1Qwen3-ASR introduces a new family of all-in-one speech recognition models (0.6B and 1.7B versions) supporting language identification and ASR for 52 languages and dialects.
- 2The release also includes Qwen3-ForcedAligner-0.6B, a novel non-autoregressive model capable of precise text-speech alignment and timestamp prediction in 11 languages.
- 3Evaluations show Qwen3-ASR-1.7B achieves state-of-the-art performance among open-source ASR models and is competitive with proprietary commercial APIs, with comprehensive inference toolkits and vLLM integration for efficient deployment.
The Qwen3-ASR project introduces a family of powerful speech recognition models, including Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, alongside a novel non-autoregressive speech forced-alignment model, Qwen3-ForcedAligner-0.6B. The primary objective is to provide an all-in-one solution for language identification and Automatic Speech Recognition (ASR) across 52 languages and dialects, and precise text-speech alignment.
The core methodology for the Qwen3-ASR ASR models is rooted in leveraging the strong audio understanding capabilities of their foundation model, Qwen3-Omni, trained on large-scale speech data. These models support both language identification and ASR for 30 distinct languages (e.g., Chinese, English, Arabic, German, French, Spanish, Japanese, Korean) and 22 Chinese dialects (e.g., Anhui, Dongbei, Cantonese, Wu, Minnan), as well as various English accents. They are designed for high-quality and robust recognition in complex acoustic environments and challenging text patterns, supporting both offline and streaming inference modes, and capable of transcribing long audio. The Qwen3-ASR-1.7B model is positioned as state-of-the-art among open-source ASR models, while the 0.6B version offers an accuracy-efficient trade-off, achieving up to 2000 times throughput at a concurrency of 128. During evaluation, models were typically run with and , employing greedy search for decoding.
The Qwen3-ForcedAligner-0.6B is a non-autoregressive model specifically designed for text-speech alignment. It provides timestamp prediction for arbitrary units (e.g., words, characters) within speech segments up to 5 minutes long. This model supports 11 languages (Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish). Evaluations demonstrate its superior timestamp accuracy compared to End-to-End (E2E) based forced-alignment models, evidenced by significantly lower Average Alignment Score (AAS) values (e.g., an average AAS of 42.9 ms on MFA-Labeled Raw datasets compared to 129.8 ms for NFA and 133.2 ms for WhisperX).
The project provides a comprehensive inference toolkit. It supports a transformers backend for general usage and a vLLM backend for optimized, faster inference, including streaming capabilities and efficient batch processing. For enhanced performance, particularly with long inputs and large batch sizes, the use of FlashAttention 2 is recommended. This optimization reduces GPU memory usage and accelerates inference speed when models are loaded in torch.float16 or torch.bfloat16. The system is deployable via Python packages, Docker containers, vLLM servers (supporting OpenAI-compatible APIs), and official DashScope APIs.
Performance is quantitatively measured using Word Error Rate (WER) for ASR and Average Alignment Score (AAS) for forced alignment. The evaluation results showcase the models' competitiveness across various public and internal benchmarks, including:
- ASR Benchmarks: WER scores on datasets like Librispeech, GigaSpeech, CommonVoice, MLS, Tedlium, WenetSpeech, AISHELL-2, SpeechIO, Fleurs, KeSpeech, and internal accented English and Chinese Mandarin/Dialect datasets. For instance, Qwen3-ASR-1.7B achieved a WER of 1.63% on Librispeech clean, 4.97% on WenetSpeech net (Chinese), and 5.10% on KeSpeech (Chinese dialect), often outperforming Whisper-large-v3, Fun-ASR, and competitive with or superior to proprietary APIs like GPT-4o and Gemini-2.5-Pro on various tasks.
- Multilingual ASR: Strong performance across diverse language sets on MLS, CommonVoice, MLC-SLM, and Fleurs benchmarks.
- Language Identification: High accuracy rates, averaging 97.9% for Qwen3-ASR-1.7B across multiple language sets.
- Singing Voice & Song Transcription: Demonstrated robustness in transcribing singing voices and songs with background music on datasets like M4Singer and EntireSongs-en/zh.
- Inference Mode Performance: Comparison between offline and streaming inference reveals minor performance degradation for streaming, indicating unified capability.
- Forced Alignment Benchmarks: Qwen3-ForcedAligner-0.6B significantly outperforms other aligners like Monotonic-Aligner, NFA, and WhisperX in terms of AAS.