New paper reveals limits of mean-based algorithms in bandit learning
First lower bound proves how fast these algorithms can learn with bandit feedback.
A new arXiv paper by Julius Durmann and Amelie Kleber tackles a fundamental question in online learning: how fast can mean-based algorithms learn when the time horizon is unknown and only bandit feedback is available? Mean-based algorithms assign low probability to actions with low average rewards and are known to converge to serially undominated actions, which approximate Nash equilibria in games. However, empirical work had suggested slower convergence compared to classic bandit algorithms like EXP3 or UCB.
The authors provide the first rigorous lower bound on the algorithm-defining sequence γ_t, formally establishing a limit on learning speed. They also propose two new mean-based algorithms: one generalizes ε-greedy to unknown horizons, and the other extends mean-based Exp3 to work without a known time horizon. Experiments demonstrate that while mean-based algorithms are slightly slower, they perform competitively with other bandit-feedback algorithms. Crucially, the paper shows that depending on the choice of γ_t, algorithms can be both mean-based and no-regret, adding important context to earlier claims about their exploitability.
- First lower bound on learning speed for mean-based algorithms with unknown horizons.
- Two new algorithms: generalized ε-greedy and mean-based Exp3 for unknown horizons.
- Mean-based algorithms can be both mean-based and no-regret depending on γ_t selection.
Why It Matters
This work refines our understanding of algorithm safety and efficiency in online learning and game theory.