Research & Papers

How to Train Your Long-Context Visual Document Model

The study trains 24B/32B parameter models to achieve SOTA on long-document visual QA, with key findings on context length matching.

Deep Dive

Researcher Austin Veselka presents the first large-scale study on training long-context vision-language models up to 344K tokens. The work systematically tests continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, achieving state-of-the-art performance on the MMLongBenchDoc benchmark. Key findings show training with matching context lengths and page indices boosts performance, and that visual long-context training transfers to improve text-only tasks.

Why It Matters

Provides a reproducible blueprint for building AI that can understand and answer questions about massive visual documents like lengthy reports or books.