Research & Papers

Two-dimensional early exit optimisation of LLM inference

A novel method coordinates sentence-by-sentence and layer-by-layer exiting for multiplicative computational savings.

Deep Dive

A team of researchers including Jan Hůla and David Adamczyk has published a paper introducing a novel "two-dimensional early exit" strategy designed to drastically speed up large language model (LLM) inference for classification tasks. Unlike traditional early exit methods that only decide when to stop processing through a model's layers, this approach adds a second dimension: it processes input incrementally, sentence-by-sentence. By coordinating when to exit both across layers and across the sequence of sentences, the method achieves multiplicative computational savings, outperforming optimizations that focus on just one dimension.

Experimental results tested on four state-of-the-art LLMs—including Llama 3.1, Llama 3.2, Gemma, and Qwen (3B to 8B parameters)—across three sentiment classification datasets. The 2D strategy delivered additional speed-ups of 1.4x to 2.3x over an optimal layer-wise early exit baseline for simpler tasks. The approach remains effective even with fine-tuned models and is designed to be model-agnostic, requiring only lightweight classification adapters. Critically, it is orthogonal to other efficiency techniques like quantization and pruning, meaning it can be combined with them for even greater gains.

The findings indicate this strategy excels in scenarios where semantic information accumulates predictably across the structure of the input, such as in sentiment analysis where later sentences often reinforce or clarify earlier ones. This suggests potential applicability to a broader range of sequence-processing tasks beyond classification, where understanding builds progressively. The work represents a significant step towards more adaptive and efficient inference, reducing the computational cost of running powerful LLMs without sacrificing accuracy for suitable tasks.

Key Points
  • Achieves 1.4x to 2.3x speed-up over optimal layer-wise early exit on models like Llama 3.1 and Gemma.
  • Coordinates sentence-wise and layer-wise exiting for multiplicative savings, processing text incrementally.
  • Model-agnostic and works with lightweight adapters, orthogonal to quantization and pruning for combined efficiency gains.

Why It Matters

Dramatically reduces the cost and latency of using LLMs for real-time classification and analysis tasks in production.