mlx-community/chatterbox-turbo-fp16 · Hugging Face
Key Points
- 1The `mlx-community/chatterbox-turbo-fp16` model is a text-to-speech system converted to the MLX format from ResembleAI's Chatterbox Turbo.
- 2It offers voice cloning functionality, allowing users to generate speech in a custom voice by providing a reference audio file.
- 3The model also supports expressive emotion control through the insertion of specific event tags like [chuckle], [sigh], or [groan] directly into the input text.
The mlx-community/chatterbox-turbo-fp16 is a Text-to-Speech (TTS) model specifically converted to the MLX format from the ResembleAI/chatterbox-turbo model. This conversion was facilitated using version 0.2.8 of the mlx-audio library, making it compatible with Apple's MLX framework for efficient on-device inference, particularly leveraging fp16 (half-precision floating-point) weights stored in safetensors format.
The core methodology and capabilities of this model, as described, revolve around flexible and expressive speech synthesis:
- Standard Text-to-Speech Generation: The model's primary function is to convert input text into synthesized speech. This is demonstrated by invoking the
mlx_audio.tts.generatecommand with a specified model and text input.
- Voice Cloning (Zero-Shot Speaker Adaptation): A key feature is its ability to perform voice cloning. This is achieved by providing a
ref_audio(reference audio) file during the generation process. The model, implicitly through its architecture (though not detailed in the provided text), analyzes the vocal characteristics (e.g., timbre, pitch, speaking style) from theref_audioand attempts to synthesize the input text in that cloned voice. The command-line interface uses--ref_audio path_to_file.wavto enable this functionality, suggesting an internal mechanism for extracting speaker embeddings or conditioning the synthesis based on the reference.
- Emotion and Expressive Control via Event Tags: The model supports explicit control over vocal expressions and non-verbal cues through a system of "expressive event tags" inserted directly into the input text. These tags allow users to programmatically introduce natural vocal events and emotional nuances into the synthesized speech. Examples of supported tags and their corresponding descriptions include:
[clear throat]: Generates a throat-clearing sound.[sigh]: Synthesizes a sighing expression.[shush]: Produces a shushing sound.[cough]: Creates a coughing sound.[groan]: Adds a groaning expression.[sniff]: Generates a sniffing sound.[gasp]: Synthesizes a gasping expression.[chuckle]: Produces a light chuckling sound.[laugh]: Generates laughter.
The model is primarily intended for use within the MLX ecosystem, requiring the mlx-audio library for its operation. It is tagged as supporting English (en) and is licensed under Apache-2.0.