ProxyCoT trains LLMs on short proxy contexts, then transfers chain-of-thought reasoning to full inputs of up to 10M tokens?

ProxyCoT trains LLMs on short proxy contexts, then transfers chain-of-thought reasoning to full inputs of up to 10M tokens.

Uses reinforcement learning or teacher distillation on proxy contexts, followed by supervised fine-tuning on full contexts?

Uses reinforcement learning or teacher distillation on proxy contexts, followed by supervised fine-tuning on full contexts.

Outperforms strong baselines (e.g., full-context fine-tuning) with reduced computational overhead and generalizes to out-of-domain tasks?

Outperforms strong baselines (e.g., full-context fine-tuning) with reduced computational overhead and generalizes to out-of-domain tasks.

Research & Papers

ProxyCoT: New training method boosts LLM reasoning on 10M-token inputs

arXiv cs.CL May 21, 2026

⚡Train on short snippets, reason on entire 10M-token documents—without the usual compute overhead.

Deep Dive

Recent large language models can process inputs of up to 10 million tokens, yet they struggle with complex reasoning tasks that span those long contexts. The key insight behind ProxyCoT is that many such tasks can be solved using only a subset of the input—a proxy context—which shares the same underlying reasoning steps. This creates a performance gap: models reason well on the short proxy but poorly on the full long context. To bridge that gap, the team first extracts high-quality chain-of-thought traces on proxy contexts using either reinforcement learning or distillation from a larger teacher model. They then ground those traces in the full long context through supervised fine-tuning. The result is a model that inherits the reasoning prowess of the proxy while learning to apply it across the entire input.

Experiments across multiple datasets show ProxyCoT consistently outperforms existing baselines—including full-context fine-tuning and retrieval-augmented methods—while demanding less computational resources. The method also generalizes well to out-of-domain tasks, suggesting it learns robust reasoning patterns rather than dataset-specific shortcuts. Accepted at ACL 2026, this work addresses a critical bottleneck as context windows continue to grow. For professionals working with legal documents, scientific papers, or long codebases, ProxyCoT hints at a future where your AI assistant can actually follow a 500-page legal brief without losing the plot.

Key Points

ProxyCoT trains LLMs on short proxy contexts, then transfers chain-of-thought reasoning to full inputs of up to 10M tokens.
Uses reinforcement learning or teacher distillation on proxy contexts, followed by supervised fine-tuning on full contexts.
Outperforms strong baselines (e.g., full-context fine-tuning) with reduced computational overhead and generalizes to out-of-domain tasks.

Why It Matters

Makes LLMs reliably reason across entire lengthy documents without expensive compute—a game changer for legal, research, and code analysis.

Read Original Article

ProxyCoT: New training method boosts LLM reasoning on 10M-token inputs

Why It Matters

Related Articles

🚀 Stay Ahead in AI