GitHub - DrewThomasson/ebook2audiobook: Generate audiobooks from e-books, voice cloning & 1158+ languages!
Service

GitHub - DrewThomasson/ebook2audiobook: Generate audiobooks from e-books, voice cloning & 1158+ languages!

DrewThomasson
2026.02.10
·GitHub·by 이호민
#Audiobook#Ebook#Python#TTS#Voice Cloning

Key Points

  • 1Ebook2audiobook is a comprehensive tool designed to convert various e-book formats into audiobooks, featuring OCR capabilities and support for chapters and metadata.
  • 2It offers high-quality text-to-speech using multiple engines like XTTSv2 and Piper-TTS, supports voice cloning, and covers over 1158 languages.
  • 3The software is resource-efficient, providing both a Gradio web interface and headless modes, with flexible deployment options including local installation and Docker for various computing environments.

The ebook2audiobook project is an open-source tool designed for converting digital text content from various e-book formats into audiobooks, incorporating advanced text-to-speech (TTS) capabilities, including voice cloning and extensive language support. It operates on a CPU, GPU (CUDA, ROCm, XPU, JETSON, MPS), or other accelerated hardware, emphasizing accessibility with minimal hardware requirements (2GB RAM / 1GB VRAM minimum).

Core Methodology and Technical Aspects:

  1. Input Processing: The system accepts a wide array of e-book formats, including .epub, .pdf, .mobi, .txt, .html, .docx, and many others. For documents containing text as images (e.g., scanned PDFs or images), it integrates Optical Character Recognition (OCR) to extract the textual content. The project prioritizes .epub and .mobi formats for optimal performance, as they facilitate automatic chapter detection and metadata extraction, leading to a more structured audiobook output. Users are advised to manually pre-process e-books to remove unwanted text, as the EPUB format specifically lacks a standardized structure for semantic elements like chapters or paragraphs, which could lead to unwanted content being converted to audio.
  1. Text-to-Speech (TTS) Engine Integration: The heart of the conversion process lies in its ability to leverage multiple advanced TTS engines. The project supports:
    • XTTSv2: A high-quality model capable of near-real voice synthesis and zero-shot voice cloning. It offers adjustable parameters such as --temperature, --length_penalty, --num_beams, --repetition_penalty, --top_k, --top_p, and --speed for fine-tuning speech generation characteristics.
    • BARK: Another advanced TTS model with configurable --text_temp and --waveform_temp for controlling its output.
    • Piper-TTS, Vits, Fairseq, Tacotron2, YourTTS: These engines provide diverse options for speech quality, speed, and language compatibility. The choice of TTS engine can be explicitly specified (--tts_engine) or is automatically determined based on the selected language. While modern TTS engines are computationally intensive on CPUs, simpler models like YourTTS and Tacotron2 can be used for faster CPU-based conversions.
  1. Voice Cloning: A prominent feature is the ability to clone a voice from a user-provided audio file (e.g., MP3 or WAV). This reference audio (--voice <path_to_voice_file>) is used by compatible TTS engines (primarily XTTSv2) to synthesize the e-book content in the cloned voice. This allows for personalized audiobook experiences using a specific individual's voice.
  1. Language Support: The system boasts extensive multilingual capabilities, supporting over 1158 languages and dialects. Language selection is typically done using ISO-639-3 codes (e.g., eng for English, ita for Italian), though ISO-639-1 two-letter codes are also supported. The chosen TTS engine must be compatible with the selected language.
  1. Structured Markup Language (SML) Tags: To provide fine-grained control over the synthesized audio, the project supports custom SML tags embedded directly within the e-book text:
    • [break]: Inserts a silence of a random duration between 0.3 and 0.6 seconds.
    • [pause]: Inserts a silence of a random duration between 1.0 and 1.6 seconds.
    • [pause:N]: Inserts a fixed pause of N seconds, where NN is a numerical value.
    • [voice:/path/to/voice/file]...[/voice]: Allows for dynamic voice switching within the audiobook. Text enclosed within these tags will be spoken using the voice from the specified audio file, overriding the default or selected cloning voice for that segment.
  1. Custom Model Integration and Fine-Tuning: Users can integrate their own pre-trained TTS models. For XTTSv2, a custom model is provided as a .zip file containing essential files such as config.json, model.pth, vocab.json, and a ref.wav (reference audio for the custom voice). The platform also supports the use of fine-tuned models for enhanced quality or specific vocal characteristics.
  1. Output Generation: The converted audiobooks can be exported in various audio and container formats, including .m4b, .m4a, .mp4, .webm, .mov, .mp3, .flac, .wav, .ogg, and .aac. The .m4b format is particularly suitable for audiobooks as it supports chapters and metadata. The output can be generated in mono or stereo channels.
  1. Deployment and Usage: The project offers both a Gradio-based graphical user interface (GUI) for ease of use and a headless command-line interface (CLI) for automated conversions. It is designed to run locally on Windows, macOS, and Linux, with cross-platform compatibility further enhanced by Docker and Podman containerization. These containerized environments support various hardware accelerators, allowing users to leverage their specific GPU (NVIDIA CUDA, AMD ROCm, Intel XPU, NVIDIA JETSON) for faster processing. The CLI allows users to specify input e-book paths (--ebook or --ebooks_dir), output directories (--output_dir), language, voice cloning files, TTS engines, and custom models. Session management is also available to resume interrupted conversions.