Value Functions for Temporal Logic: Optimal Policies and Safety Filters
Greedy Q-policies can put off tasks forever; now there's a fix.
A fundamental challenge in reinforcement learning for autonomous systems is ensuring that policies not only maximize reward but also satisfy complex temporal logic (TL) specifications — tasks like "always avoid obstacles, then eventually reach the goal." A new paper from MIT researchers (So, Sharpless, Herbert, and Fan) reveals a subtle pathology in standard Bellman-based approaches: even when the value function is optimal, greedily maximizing the Q-function can produce policies that indefinitely postpone task completion for reach-avoid problems (equivalent to Until specifications in TL). This happens in the undiscounted infinite-horizon setting, where the agent has no incentive to finish the task today vs. tomorrow.
To address this, the authors build on recent work that decomposes a TL value function into a graph of constituent value functions. They then construct non-Markovian policies — ones that depend on the state history, not just the current state — and prove their optimality with respect to the quantitative robustness score for nested Until, Globally, and Globally-Until specifications. Additionally, they show how the Q-function can act as a safety filter, extending prior results beyond simple avoid or reach-avoid tasks to complex TL specifications. This work bridges theory and practice for safety-critical robotics and AI planning.
- Identifies a pathological behavior where greedy Q-function policies defer task completion indefinitely for reach-avoid (Until) specifications in undiscounted settings.
- Constructs non-Markovian policies based on state history that are provably optimal for nested Until, Globally, and Globally-Until TL specifications.
- Extends Q-function safety filtering to complex temporal logic tasks, not just simple avoid or reach-avoid.
Why It Matters
Ensures autonomous systems reliably complete tasks and maintain safety per temporal logic rules.