How to Train Your Long-Context Visual Document Model
The study trains 24B/32B parameter models to achieve SOTA on long-document visual QA, with key findings on context length matching.
Researcher Austin Veselka presents the first large-scale study on training long-context vision-language models up to 344K tokens. The work systematically tests continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, achieving state-of-the-art performance on the MMLongBenchDoc benchmark. Key findings show training with matching context lengths and page indices boosts performance, and that visual long-context training transfers to improve text-only tasks.
Why It Matters
Provides a reproducible blueprint for building AI that can understand and answer questions about massive visual documents like lengthy reports or books.