Research & Papers

[P] On-device speech toolkit for Apple Silicon — ASR, TTS, diarization, speech-to-speech, all in native Swift

Open-source Swift package combines MLX and CoreML to run speech AI 10x faster than WhisperKit.

Deep Dive

Developer Ivan Digital has launched an open-source Swift speech toolkit specifically optimized for Apple Silicon, enabling fully local execution of 11 different speech AI models without cloud dependencies. The toolkit implements a sophisticated dual-engine architecture that strategically allocates tasks between MLX for GPU processing of large models and CoreML for Neural Engine execution of smaller models. This design solves a critical bottleneck faced by alternatives like WhisperKit, where CoreML audio encoding blocks the Neural Engine for 300-600ms per call. Instead, Ivan Digital's approach allows simultaneous operations like Voice Activity Detection (VAD) on the Neural Engine while Automatic Speech Recognition (ASR) runs on the GPU, achieving real-time factors as low as 0.06 for transcription.

The technical implementation includes Qwen3-ASR models (0.6B/1.7B parameters at 4-bit quantization) for transcription, PersonaPlex 7B for full-duplex speech-to-speech conversion, CosyVoice3 for text-to-speech with 120ms first-chunk latency, and specialized models for diarization, noise suppression, and forced alignment. All models conform to shared protocols, enabling developers to swap implementations and compose custom pipelines. Current development focuses on a MeetingTranscriber pipeline for diarized transcription and real-time streaming diarization. The project represents a significant advancement in on-device speech processing efficiency, particularly for Apple ecosystem developers building privacy-focused applications.

Key Points
  • Runs 11 speech models locally using MLX (GPU) and CoreML (Neural Engine) with strategic task allocation
  • Achieves RTF ~0.06 for ASR on M2 Max and enables full-duplex speech-to-speech with PersonaPlex 7B
  • Solves WhisperKit's ANE blocking issue by running VAD on Neural Engine while ASR processes on GPU

Why It Matters

Enables developers to build fully private, real-time speech applications for Apple devices without cloud latency or data privacy concerns.