Research & Papers

Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation

New study pits Claude Haiku 4.5 against classical planners, revealing a costly 5.7x token overhead for small improvements.

Deep Dive

A new research paper from Kai Göbel, Pierrick Lorang, Patrik Zips, and Tobias Glück empirically tests whether Large Language Models (LLMs) can function as viable task planners for autonomous systems. The team developed PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls via a Model Context Protocol (MCP) interface. This setup allows an LLM to act as an interactive agent, selecting a single action, observing the resulting simulated state, and having the ability to reset and retry, rather than committing to a full plan upfront.

In a head-to-head evaluation on 102 International Planning Competition (IPC) Blocksworld instances, the researchers compared four approaches under a uniform 180-second time budget. The classical planner Fast Downward (lama-first) set a high bar with 85.3% success. The direct LLM planning approach, using Claude Haiku 4.5, achieved 63.7% success. The novel agentic LLM planning approach via PyPDDLEngine reached 66.7%, showing a consistent but modest three-percentage-point advantage. However, this small gain came at a significant cost: the agentic method consumed 5.7 times more tokens per solution than direct planning.

The study's findings suggest that the benefits of an agentic approach depend heavily on the nature of environmental feedback. While coding agents thrive on externally grounded signals like compiler errors, the PDDL step feedback in this simulation is self-assessed, leaving the AI to evaluate its own progress without independent verification. Furthermore, the researchers note that both LLM approaches produced shorter plans than the classical seq-sat-lama-2011 planner across most co-solved problems, a result they attribute more to training-data recall than to demonstrable, generalizable planning intelligence.

Key Points
  • Agentic LLM planning via PyPDDLEngine achieved 66.7% success on IPC Blocksworld tasks, a mere 3% gain over direct LLM planning (63.7%).
  • The performance gain came at a high computational cost, with the agentic method using 5.7x more tokens per solution than direct planning.
  • Classical planner Fast Downward significantly outperformed both LLM methods, achieving 85.3% success, highlighting a substantial performance gap.

Why It Matters

This research quantifies the high cost and limited current payoff of using LLMs as interactive agents for complex planning, guiding practical AI system design.