GAP3D bridges VLM and image encoders for 3D generation without 3D data
New diffusion-based alignment unlocks zero-shot 3D assets from text prompts.
GAP3D, introduced by Polytimi Anna Gkotsi, Andrii Zadaianchuk, and Mohammad Mahdi Derakhshani, tackles a key challenge in generative 3D modeling: how to leverage powerful vision-language models (VLMs) as prompt encoders without costly end-to-end retraining. Existing methods often compress VLM features into low-dimensional representations, losing the dense spatial information needed for geometry-aware tasks. GAP3D instead aligns VLM-generated latents directly to the full, patch-level feature space of a pre-trained image encoder using a diffusion process. This allows a frozen downstream generative model—designed for 3D asset creation—to receive a spatially structured conditioning signal while benefiting from the rich semantic understanding of VLMs. A major advantage is that GAP3D trains primarily on general-domain image-text pairs, bypassing the need for large-scale 3D datasets. The method also exhibits emergent zero-shot capabilities for multimodal prompts (e.g., combining text and images), even though it was trained only on text inputs. While the authors note that GAP3D currently prioritizes high-level semantics over fine-grained geometric detail, the work demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment. This opens a modular path to integrate foundation models for 3D generation without architectural overhauls.
- Avoids expensive end-to-end training by aligning VLM latents to a frozen image encoder's patch-level features via diffusion.
- Trained on general-domain image-text pairs, eliminating the need for large-scale 3D training data.
- Achieves zero-shot multimodal prompting (text + image) despite being trained exclusively on text inputs.
Why It Matters
Democratizes 3D asset creation by enabling zero-shot generation from text without specialized 3D datasets.