DPA replaces ViT with a small VLM perceiver for deep visual-text alignment before reasoning?

DPA replaces ViT with a small VLM perceiver for deep visual-text alignment before reasoning.

Outperforms baselines by 1.9 points (4B scale) and 3.0 points (32B scale) on multimodal benchmarks?

Outperforms baselines by 1.9 points (4B scale) and 3.0 points (32B scale) on multimodal benchmarks.

Reduces language forgetting by 32.9% and works across Qwen3 and LLaMA 3.2 families?

Reduces language forgetting by 32.9% and works across Qwen3 and LLaMA 3.2 families.

Research & Papers

Deep Pre-Alignment boosts VLMs by 3 points at 32B scale

arXiv cs.CV May 18, 2026

⚡New architecture replaces ViT with a small VLM to align visual and text features upfront.

Deep Dive

Most Vision-Language Models (VLMs) use a lightweight projector to map ViT outputs to the LLM, but visual features remain distant from the text space in early LLM layers, wasting depth on alignment rather than reasoning. The new Deep Pre-Alignment (DPA) architecture replaces the ViT encoder entirely with a small VLM as a perceiver, pre-aligning visual features deeply with the target LLM's text space before any deeper processing. This offloads the alignment burden, allowing the main LLM to focus on understanding and complex reasoning.

Results show DPA outperforms baselines by 1.9 points on 8 multimodal benchmarks at the 4B scale, and the gap widens to 3.0 points at 32B. Additionally, DPA reduces language capability forgetting by 32.9% over 3 text benchmarks. These gains hold across different LLM families (Qwen3, LLaMA 3.2), and the architecture offers a seamless upgrade path by merely swapping the visual encoder with minimal computational overhead. The work was accepted at ICML 2026.

Key Points

DPA replaces ViT with a small VLM perceiver for deep visual-text alignment before reasoning.
Outperforms baselines by 1.9 points (4B scale) and 3.0 points (32B scale) on multimodal benchmarks.
Reduces language forgetting by 32.9% and works across Qwen3 and LLaMA 3.2 families.

Why It Matters

DPA enables more efficient, less forgetful VLMs, unlocking better multimodal reasoning with a simple modular upgrade.

Read Original Article

Deep Pre-Alignment boosts VLMs by 3 points at 32B scale

Why It Matters

Related Articles

🚀 Stay Ahead in AI