Robotics

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

Robots now remember past steps using a semantic graph and code planner...

Deep Dive

Researchers from multiple universities (including Khoa Vo, Sieu Tran, and others) have introduced CodeGraphVLP, a hierarchical framework designed to overcome a fundamental limitation of current Vision-Language-Action (VLA) models. Standard VLA models assume the latest observation is sufficient for action reasoning, which fails in non-Markovian long-horizon tasks where evidence may be occluded or appear earlier in the trajectory. CodeGraphVLP addresses this by combining three key components: a persistent semantic-graph state that maintains task-relevant entities and relations under partial observability, an executable code-based planner that performs efficient progress checks over this graph, and progress-guided visual-language prompting that constructs clutter-suppressed observations.

In real-world non-Markovian manipulation tasks, CodeGraphVLP demonstrated significant improvements over strong VLA baselines and history-enabled variants, while substantially lowering planning latency compared to VLM-in-the-loop planning approaches. The framework's ability to output subtask instructions together with subtask-relevant objects allows it to focus the VLA executor on critical evidence, effectively handling clutter and distractors that typically degrade fine-grained visual grounding. Extensive ablation studies confirmed the contributions of each component, showing that the semantic-graph state and code-based planner are crucial for maintaining task coherence over long horizons. This work represents a practical step toward more reliable generalist robot manipulation in complex, real-world environments.

Key Points
  • Combines a persistent semantic-graph state with a code-based planner to handle partial observability in long-horizon tasks
  • Improves task completion over strong VLA baselines while reducing planning latency compared to VLM-in-the-loop methods
  • Uses progress-guided visual-language prompting to construct clutter-suppressed observations for better focus

Why It Matters

Enables robots to reliably handle complex, long tasks by remembering past steps and ignoring visual clutter.