Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model
New paper reveals SSP problems with zero-cost actions may be unlearnable, requiring Ω(SAB⋆³/(c_minε²)) samples.
A team of researchers including Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric has published groundbreaking work on the sample complexity of learning Stochastic Shortest Path (SSP) problems, a fundamental reinforcement learning framework. Their paper, accepted at ALT 2021, establishes that learning an ε-optimal policy in SSPs requires at least Ω(SAB⋆³/(c_minε²)) samples from a generative model, where S is states, A is actions, c_min is minimum cost, and B⋆ is the optimal policy's maximum expected cost. This provides the first tight characterization of SSP learning difficulty.
The research reveals a surprising theoretical limitation: when the minimum cost c_min equals zero, SSP problems may become fundamentally unlearnable. This distinguishes SSP learning from finite-horizon and discounted settings, where zero-cost transitions don't create the same theoretical barrier. The team complemented their lower bound with matching algorithms—one achieving the bound up to logarithmic factors for general cases, and another specialized algorithm that works even when c_min=0, provided the optimal policy has bounded hitting time to the goal state.
This work establishes SSP as a distinct complexity class in reinforcement learning theory, with implications for how researchers approach planning and learning in stochastic environments. The findings suggest that practical SSP implementations must carefully consider cost structures and may need to impose minimum costs to ensure learnability. The paper provides both fundamental limits and constructive algorithms, offering a complete theoretical picture of what makes SSP problems tractable or intractable.
- Proved lower bound of Ω(SAB⋆³/(c_minε²)) samples needed for ε-optimal SSP policies with generative model access
- Revealed SSP problems with zero minimum cost (c_min=0) may be fundamentally unlearnable, unlike finite-horizon/discounted RL
- Provided matching algorithms achieving the bound up to logarithmic factors, including specialized algorithm for c_min=0 cases with bounded hitting time
Why It Matters
Establishes fundamental limits for reinforcement learning in stochastic environments, guiding development of provably efficient AI planning algorithms.