Research & Papers

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

arXiv cs.CV April 24, 2026

⚡New model mimics eye foveation to cut visual-token costs by 10x

Deep Dive

Researchers from Meta and academic institutions have introduced Foveated Reasoner, a novel autoregressive vision-language model that dramatically reduces compute costs by mimicking the human eye's foveation process. Instead of processing entire high-resolution images as a dense token grid—which can balloon visual-token counts and slow inference—the model starts with a low-resolution overview and dynamically triggers selective, high-acuity 'foveation' actions only when reasoning demands finer detail. These actions retrieve evidence from chosen image regions and inject it back into the same decoding trajectory, unifying visual focusing and reasoning in a single pass.

The team trained Foveated Reasoner using a two-stage pipeline: coldstart supervision to bootstrap initial foveation behavior, followed by reinforcement learning to jointly optimize evidence acquisition and task accuracy while penalizing trivial 'see-everything' strategies. Experiments across multiple vision-language benchmarks show the method learns effective foveation policies, achieving stronger accuracy under tight visual-token budgets compared to conventional dense-token approaches. This work could enable more efficient deployment of vision-language models in resource-constrained environments like mobile devices or real-time systems.

Key Points

Foveated Reasoner starts with low-res views and selectively zooms on key regions only when needed, cutting visual-token compute overhead
Trained via coldstart supervision then reinforcement learning to avoid trivial 'see-everything' shortcuts
Achieves stronger accuracy under tight token budgets across multiple vision-language benchmarks

Why It Matters

Enables cheaper, faster vision AI for mobile and real-time apps by mimicking human eye foveation

Read Original Article

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

Why It Matters

Stay Ahead in AI