GitHub - Blaizzy/mlx-audio: A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Key Points
- 1MLX-Audio is a library built on Apple's MLX framework, designed for fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
- 2It supports a variety of multilingual models for each task, offering features like voice customization, cloning, and speech enhancement, optimized for performance with quantization options.
- 3The library provides a command-line interface, Python API, an interactive web interface, an OpenAI-compatible REST API, and tools for model conversion and quantization.
MLX-Audio is an advanced audio processing library meticulously engineered to leverage Apple's MLX framework, thereby providing highly optimized and efficient Text-to-Speech (TTS), Speech-to-Text (STT), and Speech-to-Speech (STS) capabilities specifically on Apple Silicon (M-series chips). Its core methodology revolves around harnessing the performance benefits of MLX for accelerated on-device inference.
The library functions as an abstraction layer over various state-of-the-art deep learning models, providing a unified interface for speech synthesis, recognition, and transformation. Its key architectural choice, building directly on MLX, allows for efficient memory utilization and computation on Apple's unified memory architecture, leading to significantly faster processing times compared to general-purpose frameworks on this hardware.
Core Methodologies and Technical Details:
- Optimized Inference via MLX Framework: The fundamental technical aspect is the direct integration with Apple's MLX framework. MLX is designed to maximize throughput and minimize latency on Apple Silicon by optimizing operations for its unique hardware characteristics, including the CPU, GPU, and Neural Engine. MLX-Audio converts and runs pre-trained models (originally from frameworks like Hugging Face Transformers) into MLX-compatible formats, enabling these models to execute directly on the optimized MLX runtime. This translates to faster model loading, lower memory footprint, and high inference speeds for all integrated functionalities.
- Model Architectures and Functionalities:
- Text-to-Speech (TTS): MLX-Audio supports diverse TTS models, each employing different techniques:
- Kokoro: A fast, high-quality multilingual TTS. While the specific architecture isn't detailed, such models typically involve a text-to-phoneme converter, an acoustic model (e.g., Tacotron, FastSpeech), and a vocoder (e.g., WaveNet, HiFi-GAN) to synthesize waveform from acoustic features. It offers predefined voice presets and speed control.
- Qwen3-TTS: This model exhibits advanced voice design capabilities through multiple variants:
- *Base Model:* Generates speech with predefined voices.
- *CustomVoice Model:* Extends the base model by allowing emotion control (
instruct) during synthesis, likely achieved by conditioning the acoustic model or vocoder on emotion embeddings derived from theinstructparameter. - *VoiceDesign Model:* Enables the creation of novel voices from textual descriptions (
instruct), implying an underlying voice embedding network that can synthesize speaker characteristics from natural language prompts, which then guides the speech generation process. This leverages latent space manipulation or sophisticated conditioning mechanisms.
- CSM (Conversational Speech Model): Focuses on voice cloning. This typically involves an encoder that extracts a speaker embedding from a reference audio file (
ref_audio.wav). This speaker embedding is then used as a conditioning input to the TTS model (acoustic model and vocoder) to synthesize speech in the target speaker's voice, enabling zero-shot or few-shot voice cloning.
- Speech-to-Text (STT):
- Whisper: Implements OpenAI's robust STT model, often based on a Transformer architecture that processes audio directly (or via mel-spectrograms) and outputs text tokens.
- VibeVoice-ASR (Microsoft): A large-parameter ASR model that includes advanced features like speaker diarization and timestamping.
- *Speaker Diarization:* Identifies "who spoke when." This typically involves clustering voice activity detection (VAD) segments based on speaker embeddings, often using techniques like x-vectors or d-vectors combined with clustering algorithms (e.g., K-means, Agglomerative Clustering). The output includes speaker IDs for each segment.
- *Timestamping:* Provides start and end times for words or phrases. This is usually achieved through forced alignment or by training the ASR model to predict timestamps directly alongside text tokens.
- *Context (Hotwords):* Allows boosting recognition of specific terms, which can be implemented by biasing the decoding process (e.g., through prefix scoring or a custom language model) towards the provided keywords.
- *Streaming Transcription:* Processes audio segments incrementally, outputting text tokens as they are generated. This requires a streaming-capable ASR architecture, often employing a causal decoder or a chunk-based processing approach.
- Speech-to-Speech (STS):
- SAM-Audio (Source Separation): Utilizes text-guided source separation. This is a novel approach where a text description (e.g., "A person speaking") acts as a query to separate a target sound from a mixed audio input. Methodologically, this might involve cross-modal attention mechanisms where text embeddings guide the attention of an audio processing network (e.g., a Transformer-based U-Net) to focus on relevant sound components and suppress others. The output consists of a target audio stream and a residual stream. The
separate_longmethod indicates chunking and overlap-add processing for long audio files to manage memory and computation. - MossFormer2 SE (Speech Enhancement): Designed for noise removal. This typically employs neural network architectures (e.g., U-Nets, Conv-TasNet variants) that learn to map noisy speech waveforms or spectrograms to clean speech. The "SE" (Speech Enhancement) implies a focus on improving speech quality by suppressing background noise, often through spectral estimation or masking techniques.
- SAM-Audio (Source Separation): Utilizes text-guided source separation. This is a novel approach where a text description (e.g., "A person speaking") acts as a query to separate a target sound from a mixed audio input. Methodologically, this might involve cross-modal attention mechanisms where text embeddings guide the attention of an audio processing network (e.g., a Transformer-based U-Net) to focus on relevant sound components and suppress others. The output consists of a target audio stream and a residual stream. The
- Text-to-Speech (TTS): MLX-Audio supports diverse TTS models, each employing different techniques:
- Quantization: MLX-Audio supports quantization (3-bit, 4-bit, 6-bit, 8-bit) using a
convertscript.- Technical Process: This involves reducing the precision of model weights and activations from higher precision floating-point numbers (e.g.,
float32,bfloat16) to lower-bit integer representations. The--q-bitsparameter specifies the target bit-width. - Group Size (
--q-group-size): Quantization can be applied per-tensor or per-group. Group-wise quantization applies the same scaling factor and zero-point to a small group of weights (e.g., 64 weights), which helps maintain accuracy compared to per-tensor quantization while still offering significant compression and speedup. - Benefits: Reduces model size, decreases memory bandwidth requirements, and can improve inference speed on hardware supporting low-precision arithmetic, especially beneficial on resource-constrained devices or for larger models.
- Technical Process: This involves reducing the precision of model weights and activations from higher precision floating-point numbers (e.g.,
- OpenAI-Compatible API: The library provides an OpenAI-compatible REST API, allowing developers to integrate its functionalities using familiar endpoint structures (
/v1/audio/speech,/v1/audio/transcriptions). This facilitates interoperability and ease of adoption for developers already familiar with OpenAI's API.
- Multilingual Support: Models like Kokoro, Qwen3-TTS, and Whisper offer extensive multilingual capabilities, handled by training on diverse language datasets and using language-specific tokens or conditioning mechanisms within the model architectures.
In summary, MLX-Audio is a comprehensive audio AI library distinguished by its deep integration with Apple's MLX framework for performance, its broad support for various state-of-the-art TTS, STT, and STS models, and its technical features such as advanced voice design, speaker diarization, text-guided source separation, and efficient quantization for deployment.