GitHub - jamiepine/voicebox: The open-source voice synthesis studio powered by Qwen3-TTS.
Key Points
- 1Voicebox is an open-source, local-first voice synthesis studio designed as a private, professional alternative to cloud services, offering voice cloning, speech generation, and DAW-like editing features.
- 2Powered by Qwen3-TTS, it enables high-fidelity voice cloning from short audio samples, multi-voice project creation with a timeline editor, and in-app recording and transcription.
- 3Built with Tauri and leveraging MLX for Apple Silicon, Voicebox provides native performance, a full REST API for integration, and flexible deployment options including local and remote server modes.
Voicebox is an open-source, local-first voice synthesis studio designed to provide a privacy-focused alternative to cloud-based voice AI services like ElevenLabs. It offers professional-grade tools for voice cloning, speech generation, and audio editing, all running entirely on the user's local machine without cloud dependencies.
The core methodology of Voicebox revolves around leveraging advanced deep learning models for voice synthesis and a robust application architecture for local execution. At its heart, Voicebox utilizes Qwen3-TTS (Qwen-Audio from Alibaba) as its primary voice cloning model. This model is highlighted for its ability to achieve high-fidelity, near-perfect voice cloning from just a few seconds of audio, preserving natural prosody, emotion, and cadence across multiple languages (initially English and Chinese).
For efficient inference, Voicebox employs a dual-backend strategy:
- On Apple Silicon (M1/M2/M3) Macs, it utilizes the MLX backend to leverage Apple's Neural Engine and Metal acceleration, achieving 4-5x faster inference speeds.
- For Windows, Linux, and Intel Macs, it defaults to a PyTorch backend, with CUDA GPU acceleration recommended for optimal performance, though CPU operation is supported at a slower pace.
The application's architecture is built for native performance and flexibility:
- The desktop application is developed with Tauri (Rust), chosen over Electron for its significantly smaller bundle size, native performance characteristics, and lower memory footprint.
- The frontend is a modern web application built with React, TypeScript, and Tailwind CSS, managing state with Zustand and data fetching with React Query.
- The backend is implemented as a FastAPI (Python) server, which provides an asynchronous API and automatic OpenAPI schema generation, facilitating a type-safe end-to-end experience via a generated TypeScript client.
- Data persistence is handled by SQLite, and audio waveform visualization is managed by WaveSurfer.js, complemented by librosa for audio processing.
Voicebox exposes a comprehensive REST API locally, enabling seamless integration into custom applications. This API allows programmatic control over functions such as generating speech (e.g., POST /generate with {"text": "Hello world", "profile_id": "abc123", "language": "en"}), listing voice profiles (GET /profiles), and creating new profiles (POST /profiles).
Key features include:
- Voice Profile Management: Creation from audio files or direct recording, import/export, multi-sample support for higher quality, and organization with descriptions and language tags.
- Speech Generation: Text-to-speech with any cloned voice, batch generation for long-form content, and smart caching for instant regeneration.
- Stories Editor: A timeline-based digital audio workstation (DAW)-like interface for creating multi-voice narratives, podcasts, and conversations. It supports multi-track composition, inline audio editing (trimming, splitting), synchronized auto-playback, and voice mixing.
- Recording & Transcription: In-app recording with waveform visualization, system audio capture (macOS/Windows), and automatic transcription.
- Generation History: A full history of generated audio with search, filter, and one-click re-generation capabilities.
- Flexible Deployment: Supports purely local operation, connecting to a GPU server on the network (remote mode), or transforming any machine into a Voicebox server.
The roadmap outlines future enhancements such as real-time synthesis, multi-speaker conversation mode with automatic turn-taking, voice effects (pitch shift, reverb), more fine-grained timeline editing, support for additional open-source models (XTTS, Bark), voice design from text descriptions, complex project saving/loading, a plugin architecture, and a mobile companion app. The project emphasizes privacy by ensuring models and voice data remain on the user's machine.