AI Safety

Preliminary Explorations on Latent Side Task Uplift

New research shows Claude Opus 4.5's latent reasoning improves by 20% with 240 lines of irrelevant output.

Deep Dive

Researchers from the AI Control MATS Program, led by Bruce W. Lee, published "Preliminary Explorations on Latent Side Task Uplift," investigating whether large language models can solve hidden tasks better when given more computational runway. The team adapted existing filler token experiments into a controlled setup with a main task (repeating phrases) and a side task (solving arithmetic problems). They tested four models—Claude Opus 4.5, Claude Sonnet 4, GPT-5.2, and GPT-4.1—on a dataset of 1,000 multi-step arithmetic problems ranging from 1 to 10 operations.

Claude Opus 4.5 showed the most significant effect: when generating 240 lines of irrelevant text like "the quick brown fox" before answering, its 50% accuracy threshold shifted from solving ~5.2-step problems to ~6-step problems—a roughly 20% increase in difficulty handling. This suggests the model can leverage the additional parallel computation from attending to more positions in its output. Other models showed minimal or no improvement, with GPT models displaying no clear trend. The researchers also tested whether generating vs. prefilling prose made a difference and found no meaningful variation, indicating the effect isn't about steganographic encoding.

The experiments used strict controls to suppress explicit reasoning: API parameters disabled scratchpad thinking, prompts explicitly forbade explanations, and responses were filtered for any reasoning attempts. This makes the findings particularly interesting—they reveal latent capabilities that emerge without the model's conscious processing. The code has been made available on GitHub, allowing other researchers to build on these explorations of how LLMs utilize computational resources during generation.

Key Points
  • Claude Opus 4.5 showed 20% improvement in latent arithmetic, solving 6-step problems vs. 5-step baseline after 240 lines of filler text
  • Other tested models (GPT-5.2, GPT-4.1, Claude Sonnet 4) showed minimal or no gains from longer trajectories
  • Researchers used strict reasoning suppression via API parameters and prompt engineering to isolate latent capabilities

Why It Matters

Reveals how LLMs utilize parallel computation during generation, with implications for AI safety and understanding emergent capabilities.