GitHub - altalt-org/Lightning-SimulWhisper: An MLX/CoreML implementation of SimulStreaming. ~15x increase in performance
Service

GitHub - altalt-org/Lightning-SimulWhisper: An MLX/CoreML implementation of SimulStreaming. ~15x increase in performance

altalt-org
2025.11.09
·GitHub·by Anonymous
#MLX#CoreML#Whisper#Speech Recognition#Apple Silicon

Key Points

  • 1Lightning-SimulWhisper is an MLX/CoreML implementation of SimulStreaming for real-time local transcriptions on Apple Silicon, offering substantial performance and power efficiency improvements.
  • 2It leverages a hybrid architecture, using a CoreML encoder for up to 18x faster encoding on the Neural Engine and an MLX decoder for up to 15x faster decoding, employing the AlignAtt policy for simultaneous speech recognition.
  • 3This optimized design enables real-time execution of larger Whisper models (e.g., medium, large-v3-turbo) on Apple Silicon devices with significantly lower power consumption compared to MLX-only solutions.

Lightning-SimulWhisper is a high-performance, real-time local transcription system optimized for Apple Silicon devices, leveraging Apple's MLX machine learning framework and CoreML. It specifically implements the Whisper model for simultaneous speech recognition, adopting the AlignAtt policy for efficient streaming. The project distinguishes itself by eliminating PyTorch dependencies and achieving substantial speed and power efficiency improvements.

The core methodology employs a hybrid architectural approach to optimize different stages of the Whisper pipeline:

  1. Audio Input Processing: The initial audio input (16kHz mono) is first processed by MLX to generate the Mel Spectrogram.
  2. CoreML Encoder Acceleration: This is the pivotal optimization. The encoder component of the Whisper model is offloaded to CoreML, utilizing the whisper.cpp integration to harness Apple's Neural Engine (ANE). This dramatically accelerates the most computationally intensive part of the model, yielding up to an 18x speedup for encoding compared to traditional implementations. The output encoder features are then converted from CoreML's format back into MLX tensors.
  3. MLX Decoder Implementation: The decoding phase runs entirely on MLX. This allows for flexible and fine-grained control over the decoding process, including the implementation of the AlignAtt policy, which is a state-of-the-art strategy for simultaneous decoding in real-time. The MLX decoder demonstrates up to a 15x speedup compared to PyTorch-based implementations, further contributing to overall performance.
  4. Transcription Output: The decoded tokens are then assembled into the final transcription.

This hybrid architecture, where the compute-heavy encoder utilizes the Neural Engine via CoreML for unparalleled speed and power efficiency, and the flexible decoder leverages MLX's native Apple Silicon optimizations, enables real-time transcription even with larger Whisper models (e.g., medium, large-v3-turbo) on devices like the M2 MacBook Pro. The use of CoreML for encoding significantly reduces power consumption compared to MLX-only implementations.

Key features include:

  • Native Apple Silicon optimization with MLX and CoreML.
  • Up to 18x encoder speedup via CoreML and Apple Neural Engine.
  • Up to 15x decoder speedup via MLX.
  • Implementation of the AlignAtt policy for simultaneous decoding.
  • Support for various Whisper model sizes (tiny, base, small, medium, large-v1/v2/v3).
  • Configurable beam search decoding (e.g., --beams 3).
  • Real-time streaming capabilities from both audio files (simulation) and live microphone input.
  • Integration with CIF (Continuous Integrate and Fire) models for improved word boundary detection (e.g., --cif_ckpt_path cif_model/medium.npz).
  • Optional Voice Activity Detection (VAD) using Silero (requires torchaudio).