ALAR cuts LLM agent tokens by 84% while keeping accuracy
New method slashes reasoning tokens up to 84.6% without losing task performance.
A team of researchers (Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen) from multiple institutions have proposed Adaptive Latent Agentic Reasoning (ALAR), a new framework that addresses a key inefficiency in current LLM agents: they generate verbose textual reasoning at every decision step, wasting tokens on routine actions. ALAR operates in a dual mode — it uses compact, non-textual latent reasoning for straightforward turns and escalates to explicit chain-of-thought (CoT) only when the agent encounters a harder decision. The latent reasoning is learned by using the agent's actions as supervision anchors, optimizing the model to keep reasoning internal when that suffices for success.
In experiments on agentic search and tool-use benchmarks, ALAR achieved comparable or better task accuracy while dramatically reducing token consumption: up to 43.6% fewer tokens in search tasks and a striking 84.6% reduction in tool-use scenarios. This means agents that can think much faster and cheaper without sacrificing quality. The work highlights a smarter approach to agentic reasoning — adapting the amount of computation per turn rather than applying uniform CoT depth — and could make LLM-powered agents more practical for real-world, multi-turn interactions where latency and cost matter.
- ALAR uses a dual-mode framework: latent reasoning for routine turns, explicit CoT for hard decisions.
- Token reduction of up to 43.6% on agentic search and 84.6% on tool-use benchmarks.
- Maintains or improves task accuracy while cutting unnecessary textual reasoning.
Why It Matters
ALAR makes LLM agents faster and cheaper by eliminating wasteful verbose reasoning on routine steps.