Internalized Reasoning for Long-Context Visual Document Understanding
New method makes AI reason internally on long documents, cutting output tokens by over 12x.
Researcher Austin Veselka has published a novel AI training technique called 'Internalized Reasoning for Long-Context Visual Document Understanding.' The core innovation is a synthetic data pipeline that teaches AI models to perform reasoning internally before generating a final answer. The pipeline works by having a model score document pages for relevance to a question, extract textual evidence, and order that evidence. This 'thinking trace' is then used to fine-tune models using Supervised Fine-Tuning (SFT) within special <think> tags, controlled by a <cot> (chain-of-thought) token. The final reasoning capability is 'internalized' using a low-strength model merging technique.
The method was tested on two prominent open-source vision-language models: Qwen3 VL 32B and Mistral Small 3.1 24B. The results are significant. The enhanced 32-billion-parameter Qwen3 model scored 58.3 on the MMLongBenchDoc benchmark, surpassing the score of a vastly larger 235-billion-parameter version of Qwen3 (57.0). With Mistral, the synthetic reasoning approach outperformed a simpler distillation method by 3.8 points. Crucially, this internalized reasoning leads to dramatically more efficient outputs: models using this method produced a mean of 12.4 times fewer output tokens compared to models that reason explicitly step-by-step in their response. The entire pipeline has been released publicly for reproducibility.
- A 32B parameter Qwen3 VL model, trained with Internalized Reasoning, outperformed a 7x larger 235B model on the MMLongBenchDoc benchmark (58.3 vs 57.0).
- The technique reduces verbose reasoning, resulting in AI outputs that are 12.4 times more concise on average compared to standard chain-of-thought methods.
- The synthetic training pipeline is open-source, enabling others to apply internalized reasoning to different models and long-context tasks.
Why It Matters
Enables faster, cheaper AI for analyzing lengthy legal, scientific, and business documents without sacrificing accuracy.