Research & Papers

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

New 'Question-Asking' protocol transfers knowledge between AI models using only 10 binary questions, achieving 100x better compression.

Deep Dive

A team of researchers including Roy Rinberg and Nicholas Carlini has published a groundbreaking paper demonstrating how large language models (LLMs) can achieve unprecedented compression ratios through interactive protocols. Their work introduces a 'compression-compute frontier,' showing that more compression is possible at the cost of additional computation. For lossless compression, they found that using domain-adapted LoRA adapters can double compression efficiency over using a base LLM alone. For lossy compression, simply prompting a model for a succinct rewrite before applying arithmetic coding achieves compression ratios around 0.03—another 2x improvement.

The most significant innovation is the 'Question-Asking' (QA) compression protocol, an interactive lossy method inspired by the game 'Twenty Questions.' In this setup, a smaller, cheaper model (like Anthropic's Claude Haiku) iteratively refines its response by asking a series of yes/no questions to a much larger, more capable model (like Claude Opus). Each answer transfers exactly one bit of information. Remarkably, using just 10 binary questions (10 bits total) allowed the small model to recover 23% to 72% of the capability gap on standard benchmarks and 7% to 38% on harder ones.

This method achieved astonishing compression ratios between 0.0006 and 0.004. This represents an improvement of over 100x compared to prior LLM-based compression techniques, such as those from Deletang et al. in 2024. The research suggests that future AI systems could communicate complex knowledge and coordinate tasks not by sending lengthy, token-heavy responses, but through ultra-efficient, bit-level interactive dialogues. This has profound implications for reducing bandwidth costs, improving latency in distributed AI systems, and enabling collaboration between models of vastly different sizes and capabilities.

Key Points
  • The 'Question-Asking' (QA) protocol uses just 10 yes/no questions (10 bits) to transfer knowledge from a large model (Opus) to a small one (Haiku).
  • This method recovers 23-72% of the performance gap on benchmarks and achieves compression ratios as low as 0.0006, a 100x improvement over prior art.
  • The work establishes a 'compression-compute frontier,' proving interactive protocols are far more efficient for knowledge transfer than transmitting full model responses.

Why It Matters

This could drastically reduce the cost and latency of AI coordination, enabling efficient collaboration between massive cloud models and lightweight edge devices.