GitHub - cjpais/Handy: A free, open source, and extensible speech-to-text application that works completely offline.
Key Points
- 1Handy is a free, open-source, and extensible desktop application that provides privacy-focused, offline speech-to-text transcription by converting spoken words into text in any field.
- 2Built with Tauri (Rust + React), it processes audio locally using Voice Activity Detection and Whisper or Parakeet models, pasting transcribed text via configurable keyboard shortcuts.
- 3Available on Windows, macOS, and Linux, Handy emphasizes local processing, community contribution, and offers detailed guides for installation, development, and troubleshooting, including manual model setup.
Handy is a free, open-source, and extensible desktop application for offline speech-to-text transcription, built with Tauri (Rust and React/TypeScript). Its primary purpose is to provide a privacy-focused solution that keeps all voice processing local to the user's machine, eliminating the need to send audio to cloud services. The project emphasizes being easily "forkable" due to its simple, well-patterned codebase.
The core methodology involves a client-side, local processing pipeline:
- Audio Capture and Activation: The user initiates recording by pressing a configurable global keyboard shortcut (or using push-to-talk mode). The application leverages the
rdevRust library for global keyboard shortcuts and system event handling. - Voice Activity Detection (VAD): As audio is captured, silence is dynamically filtered out using Voice Activity Detection based on the Silero VAD model, implemented via the
vad-rslibrary. This ensures that only speech segments are processed, optimizing resource usage and transcription accuracy. Thecpallibrary handles cross-platform audio input/output, andrubatois used for audio resampling. - Speech Recognition: Once recording stops (upon releasing the shortcut), the captured speech is transcribed using one of the user-selected, locally stored models. Handy supports two primary model families:
- Whisper Models: These models (Small, Medium, Turbo, Large variants) are implemented using
whisper-rs(a Rust binding forwhisper.cpp) and leverage GPU acceleration when available on compatible hardware (macOS M-series, Intel Mac, Windows/Linux with Intel, AMD, or NVIDIA GPUs). - Parakeet V3 Model: This is a CPU-optimized model, implemented via
transcription-rs, known for excellent performance (approximately 5x real-time speed on mid-range i5 processors) and automatic language detection.
- Whisper Models: These models (Small, Medium, Turbo, Large variants) are implemented using
- Text Output: The transcribed text is then automatically pasted directly into the active text field of the application the user is currently using. This process is managed by system event libraries. On Linux, specific tools like
xdotool(for X11) orwtype/dotool(for Wayland) are often required for reliable text input.
The application's architecture separates the user interface (Frontend: React + TypeScript with Tailwind CSS) from the core logic (Backend: Rust, handling system integration, audio processing, and ML inference).
Handy supports macOS (Intel and Apple Silicon), x64 Windows, and x64 Linux.
Known limitations include potential Whisper model crashes on certain Windows/Linux configurations and limited Wayland display server support on Linux, often requiring external text input utilities. Global keyboard shortcuts on Wayland require manual configuration within the desktop environment or window manager, utilizing Handy's CLI flags (e.g., handy --toggle-transcription) or Unix signals (SIGUSR1, SIGUSR2) for integration.
Users can manually install models if automatic downloads are hindered by network restrictions. Handy also allows the use of custom Whisper GGML models by simply placing them in the application's models directory, which are then auto-discovered and made available in settings.