Deep Pre-Alignment boosts VLMs by 3 points at 32B scale
New architecture replaces ViT with a small VLM to align visual and text features upfront.
Most Vision-Language Models (VLMs) use a lightweight projector to map ViT outputs to the LLM, but visual features remain distant from the text space in early LLM layers, wasting depth on alignment rather than reasoning. The new Deep Pre-Alignment (DPA) architecture replaces the ViT encoder entirely with a small VLM as a perceiver, pre-aligning visual features deeply with the target LLM's text space before any deeper processing. This offloads the alignment burden, allowing the main LLM to focus on understanding and complex reasoning.
Results show DPA outperforms baselines by 1.9 points on 8 multimodal benchmarks at the 4B scale, and the gap widens to 3.0 points at 32B. Additionally, DPA reduces language capability forgetting by 32.9% over 3 text benchmarks. These gains hold across different LLM families (Qwen3, LLaMA 3.2), and the architecture offers a seamless upgrade path by merely swapping the visual encoder with minimal computational overhead. The work was accepted at ICML 2026.
- DPA replaces ViT with a small VLM perceiver for deep visual-text alignment before reasoning.
- Outperforms baselines by 1.9 points (4B scale) and 3.0 points (32B scale) on multimodal benchmarks.
- Reduces language forgetting by 32.9% and works across Qwen3 and LLaMA 3.2 families.
Why It Matters
DPA enables more efficient, less forgetful VLMs, unlocking better multimodal reasoning with a simple modular upgrade.