Research & Papers

Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

GPT-4o leads AI models for guiding blind users, but open-source rivals like LLaVA struggle with spatial reasoning.

Deep Dive

A research team from NYU and collaborating institutions has published a comprehensive study evaluating the potential of current vision-language models (VLMs) to serve as navigation assistants for people with blindness and low vision (pBLV). The paper, posted to arXiv, rigorously tested state-of-the-art closed-source models—including OpenAI's GPT-4V and GPT-4o, Google's Gemini 1.5 Pro, and Anthropic's Claude 3.5 Sonnet—against open-source alternatives like LLaVA-v1.6-mistral and LLaVA-OneVision-Qwen. The assessment focused on three foundational visual skills critical for navigation: accurately counting ambient obstacles, performing relative spatial reasoning (e.g., "left of"), and demonstrating common-sense, wayfinding-pertinent scene understanding.

The findings reveal a stark performance hierarchy. OpenAI's GPT-4o consistently outperformed all other models across every task, demonstrating superior capability in complex spatial reasoning and providing contextually relevant scene descriptions that could guide a user. In contrast, the open-source models, while advancing rapidly, exhibited significant shortcomings. They struggled with accurately counting objects in cluttered environments, showed biases in spatial judgments, and often prioritized listing object details over delivering actionable spatial feedback—a critical failure for a navigation aid. The study used pBLV-specific prompts to simulate real assistance scenarios, further exposing these models' limitations in adaptability and nuanced reasoning.

Despite these gaps, the research confirms the strong promise of VLMs, particularly leading models like GPT-4o, for future assistive technology. The paper provides actionable insights for developers, emphasizing that effective integration into navigation aids requires models to be better aligned with human feedback paradigms and equipped with significantly improved, reliable spatial reasoning modules. This benchmark study serves as a crucial roadmap, highlighting both the immediate potential of top-tier commercial models and the specific technical hurdles that must be overcome to make this technology universally accessible and safe for pBLV users.

Key Points
  • GPT-4o outperformed Claude 3.5 Sonnet, Gemini 1.5 Pro, and open-source models like LLaVA across all navigation-relevant tasks.
  • Open-source VLMs struggled with critical functions: accurately counting obstacles in clutter and providing useful spatial reasoning feedback.
  • The study provides a benchmark for developers, showing AI can aid navigation but requires better spatial reasoning and human feedback alignment.

Why It Matters

This benchmark reveals which AI models are currently viable for building real-world navigation aids, a critical step toward accessible assistive technology.