Research & Papers

Scaling View Synthesis Transformers

New encoder-decoder architecture achieves state-of-the-art results with substantially reduced training compute.

Deep Dive

A research team including Evan Kim and Vincent Sitzmann has published a pivotal study on scaling laws for view synthesis transformers, introducing the Scalable View Synthesis Model (SVSM). The work, submitted to CVPR 2026, systematically investigates how geometry-free transformers scale with computational resources for the task of Novel View Synthesis (NVS)—generating new perspectives of a 3D scene from limited 2D images. The key finding overturns conventional wisdom in the field, demonstrating that encoder-decoder architectures, previously thought to be suboptimal, can actually be compute-optimal when designed correctly. The researchers attribute earlier negative results to flawed architectural comparisons and training budget imbalances.

The team's SVSM architecture establishes new design principles for training compute-optimal NVS models. Across multiple compute levels, SVSM scales as effectively as decoder-only models and achieves a superior performance-compute Pareto frontier. Most significantly, it surpasses the previous state-of-the-art on real-world NVS benchmarks while using substantially reduced training compute. This breakthrough has immediate implications for reducing the energy and financial costs of training advanced vision models. It provides a clearer roadmap for efficiently scaling 3D reconstruction and generation capabilities, which are critical for applications in augmented reality, robotics, and content creation.

Key Points
  • Overturns prior belief that decoder-only models are optimal for view synthesis, proving encoder-decoder architectures like SVSM can be compute-optimal.
  • Achieves a superior performance-compute Pareto frontier and surpasses previous SOTA benchmarks with substantially reduced training compute.
  • Provides systematic scaling laws and design principles for efficiently training future Novel View Synthesis models.

Why It Matters

Enables more efficient and accessible high-quality 3D scene generation, reducing costs for AR/VR, robotics, and digital content creation.