Robotics

Value Functions for Temporal Logic: Optimal Policies and Safety Filters

Greedy Q-policies can put off tasks forever; now there's a fix.

Deep Dive

A fundamental challenge in reinforcement learning for autonomous systems is ensuring that policies not only maximize reward but also satisfy complex temporal logic (TL) specifications — tasks like "always avoid obstacles, then eventually reach the goal." A new paper from MIT researchers (So, Sharpless, Herbert, and Fan) reveals a subtle pathology in standard Bellman-based approaches: even when the value function is optimal, greedily maximizing the Q-function can produce policies that indefinitely postpone task completion for reach-avoid problems (equivalent to Until specifications in TL). This happens in the undiscounted infinite-horizon setting, where the agent has no incentive to finish the task today vs. tomorrow.

To address this, the authors build on recent work that decomposes a TL value function into a graph of constituent value functions. They then construct non-Markovian policies — ones that depend on the state history, not just the current state — and prove their optimality with respect to the quantitative robustness score for nested Until, Globally, and Globally-Until specifications. Additionally, they show how the Q-function can act as a safety filter, extending prior results beyond simple avoid or reach-avoid tasks to complex TL specifications. This work bridges theory and practice for safety-critical robotics and AI planning.

Key Points
  • Identifies a pathological behavior where greedy Q-function policies defer task completion indefinitely for reach-avoid (Until) specifications in undiscounted settings.
  • Constructs non-Markovian policies based on state history that are provably optimal for nested Until, Globally, and Globally-Until TL specifications.
  • Extends Q-function safety filtering to complex temporal logic tasks, not just simple avoid or reach-avoid.

Why It Matters

Ensures autonomous systems reliably complete tasks and maintain safety per temporal logic rules.