Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
New model mimics eye foveation to cut visual-token costs by 10x
Researchers from Meta and academic institutions have introduced Foveated Reasoner, a novel autoregressive vision-language model that dramatically reduces compute costs by mimicking the human eye's foveation process. Instead of processing entire high-resolution images as a dense token grid—which can balloon visual-token counts and slow inference—the model starts with a low-resolution overview and dynamically triggers selective, high-acuity 'foveation' actions only when reasoning demands finer detail. These actions retrieve evidence from chosen image regions and inject it back into the same decoding trajectory, unifying visual focusing and reasoning in a single pass.
The team trained Foveated Reasoner using a two-stage pipeline: coldstart supervision to bootstrap initial foveation behavior, followed by reinforcement learning to jointly optimize evidence acquisition and task accuracy while penalizing trivial 'see-everything' strategies. Experiments across multiple vision-language benchmarks show the method learns effective foveation policies, achieving stronger accuracy under tight visual-token budgets compared to conventional dense-token approaches. This work could enable more efficient deployment of vision-language models in resource-constrained environments like mobile devices or real-time systems.
- Foveated Reasoner starts with low-res views and selectively zooms on key regions only when needed, cutting visual-token compute overhead
- Trained via coldstart supervision then reinforcement learning to avoid trivial 'see-everything' shortcuts
- Achieves stronger accuracy under tight token budgets across multiple vision-language benchmarks
Why It Matters
Enables cheaper, faster vision AI for mobile and real-time apps by mimicking human eye foveation