Research & Papers

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

New technique enables mismatched Polish LLMs to run faster on Apple Silicon, achieving up to 1.7x speedup.

Deep Dive

Researcher Krzysztof Fonal has published a paper demonstrating a novel method to accelerate Polish-language AI models on Apple Silicon Macs. The research extends the MLX-LM framework with Universal Assisted Generation (UAG), enabling speculative decoding between models with different tokenizers—a technique previously limited to models from the same family. This "cross-family" approach allows a small, fast "draft" model to propose tokens for a larger, more accurate "target" model to verify, significantly speeding up text generation.

The study evaluated the Polish-specialized Bielik 11B-Instruct model (based on Mistral architecture) paired with three different draft models: Bielik 1.5B, Qwen2.5-1.5B, and Llama 3.2-1B. Using three Polish datasets, researchers tested draft lengths of 2, 4, and 6 tokens with both naive and context-aware token translation methods. Results showed context-aware translation consistently improved acceptance rates, though surprisingly, the Polish-optimized Bielik 1.5B draft model performed worse than general-purpose Qwen2.5 and Llama 3.2 models.

Performance on Apple Silicon's unified memory architecture revealed important hardware constraints. While the technique achieved up to 1.7x speedup for structured text like Wikipedia articles, it failed to accelerate varied instruction-based content. The research found that verification costs don't amortize as expected because both models are memory-bandwidth bound on unified memory systems, making sequential drafting expensive relative to batched verification. The paper proposes a new hardware-aware speedup formula to characterize when cross-family speculative decoding is beneficial on consumer hardware.

Key Points
  • UAG extension to MLX-LM enables cross-tokenizer speculative decoding, allowing mismatched Polish LLMs to work together
  • Achieved up to 1.7x speedup on structured Polish text but failed on varied instructions due to memory bandwidth limits
  • Context-aware token translation improved acceptance rates, but Polish-specialized draft models underperformed general-purpose alternatives

Why It Matters

Enables faster Polish AI applications on consumer Macs and provides framework for optimizing LLMs on unified memory architectures.