microsoft/Phi-4-reasoning-vision-15B · Hugging Face
A 15B-parameter model that uses a unique 'mid-fusion' architecture and dynamic 3,600-token vision encoder for high-res image understanding.
Microsoft Research has unveiled Phi-4-Reasoning-Vision-15B, a new open-weight multimodal model designed for advanced visual reasoning. Built on the Phi-4-Reasoning language backbone and a SigLIP-2 vision encoder, it employs a 'mid-fusion' architecture where visual tokens are projected into the language model's embedding space. This approach aims to leverage the strengths of both pre-trained components while keeping computational costs manageable. The model is trained with Supervised Fine-Tuning (SFT) on a carefully curated mix of reasoning and non-reasoning data, drawing from filtered open-source datasets and high-quality internal Microsoft data.
The technical core of Phi-4-Reasoning-Vision-15B is its dynamic resolution vision encoder, which can handle up to 3,600 visual tokens, enabling high-resolution image understanding critical for GUI grounding and fine-grained document analysis. A key innovation is its flexible reasoning mode: it can invoke extended chain-of-thought reasoning within <think>...</think> blocks for complex tasks like mathematics, or default to direct inference (tagged with <nothink>) for perception-focused tasks like object detection. This single-system design avoids the need for separate specialized models. Notably, it was trained with moderate compute—240 NVIDIA B200 GPUs for just four days—highlighting a data-centric approach that contrasts with models requiring massive training resources.
- Uses a 'mid-fusion' architecture combining Phi-4-Reasoning LLM and SigLIP-2 vision encoder for efficient multimodal processing.
- Features a dynamic vision encoder handling up to 3,600 visual tokens for high-res tasks like GUI and document analysis.
- Operates as a single system that toggles between chain-of-thought reasoning (<think>) and direct inference (<nothink>) modes.
Why It Matters
It provides a powerful, efficient open-weight alternative for high-resolution visual reasoning tasks, reducing the compute barrier for advanced multimodal AI.