Research & Papers

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

New AI model dynamically adjusts visual detail from pixel-level to broad concepts based on text prompts.

Deep Dive

A research team led by Junyuan Mao has introduced Granulon, a new multimodal large language model (MLLM) designed to solve a fundamental limitation in current vision-language AI. Most MLLMs like GPT-4V rely on CLIP-based visual encoders, which excel at global semantic alignment (e.g., recognizing a "dog") but struggle with fine-grained details (e.g., identifying the dog's breed or a specific spot on its fur). In contrast, models like Meta's DINOv3 offer strong pixel-level perception but lack the ability to form high-level semantic concepts. Granulon bridges this gap by building on DINOv2/DINOv3 and making its visual understanding adaptive.

Granulon's core innovation is a two-part architecture that allows it to reason across multiple levels of visual detail dynamically. First, a text-conditioned 'granularity Controller' analyzes the user's query to determine the required level of visual abstraction—whether it needs pixel-perfect detail or a broader conceptual understanding. Second, an 'Adaptive Token Aggregation' module then processes the visual input accordingly, using granularity-guided pooling and clustering to produce a compact set of semantically rich visual tokens for the LLM. This enables what the authors call unified "pixel-to-fine-to-coarse" reasoning in a single forward pass.

The results are significant. In extensive testing, Granulon demonstrated a ~30% improvement in accuracy on tasks requiring detailed visual understanding and a ~20% reduction in visual hallucinations compared to models using CLIP or other visual encoders under identical settings. This performance leap suggests that the future of multimodal AI may not lie in a single, static visual encoder, but in systems that can fluidly adapt their 'visual focus' based on the conversation, much like a human would.

Key Points
  • Uses Meta's DINOv3 as a base for pixel-level perception, enhanced with new adaptive modules.
  • Dynamically adjusts visual processing granularity based on text query via a Controller and Token Aggregation module.
  • Achieves ~30% higher accuracy and ~20% fewer hallucinations than CLIP-based MLLMs in comparable tests.

Why It Matters

Enables AI to see and reason with appropriate detail, reducing errors in medical imaging, quality control, and detailed visual Q&A.