Research & Papers

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Fine-tuned Gemma 3 27B matches human ratings from street-view images alone...

Deep Dive

Researchers from the University of Notre Dame and collaborators have created a scalable AI framework that automatically evaluates building conditions across the United States using Google Street View imagery. By fine-tuning Google's Gemma 3 27B large language model on a modest human-labeled dataset, the system achieves strong alignment with human mean opinion scores (MOS)—outperforming even individual human raters on Spearman rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) metrics.

To address real-world deployment needs, the team applied knowledge distillation to transfer capabilities to smaller models. A distilled Gemma 3 4B achieves comparable performance with a 3x speedup, while further distillation into CNN (EfficientNetV2-M) and transformer (SwinV2-B) architectures delivers close-to-original performance at a 30x speed gain. The framework also assesses a wide range of built environment and housing attributes through a human-AI alignment study, with results integrated into a visualization dashboard for homeowners and downstream analysis.

Key Points
  • Fine-tuned Gemma 3 27B outperforms individual human raters on SRCC and PLCC against mean opinion scores
  • Knowledge distillation to Gemma 3 4B achieves comparable performance with 3x speedup
  • CNN (EfficientNetV2-M) and transformer (SwinV2-B) versions deliver 30x faster inference with close accuracy

Why It Matters

Enables large-scale, low-cost building condition assessment from street-view imagery, reducing human labeling effort.