Online learning with Erd\H{o}s-R\'enyi side-observation graphs
Two algorithms for multi-armed bandits with random side observations achieve near-optimal regret bounds.
In a paper published at ICML 2015, researchers Tomáš Kocák, Gergely Neu, and Michal Valko tackle the problem of online learning with side observations in adversarial multi-armed bandit settings. Specifically, they consider a scenario where after pulling an arm, the learner observes not only the loss of the chosen arm but also the losses of all other arms, each with a fixed but unknown probability r, independent of each other and the learner's actions. This setup models real-world situations where partial feedback is available, such as in recommendation systems or network routing.
The authors propose two algorithms tailored to different ranges of r. The first algorithm works when r is relatively large (r ≥ (log T)/(2N)), achieving a regret bound of O(√((T/r) log N)). For smaller r, the second algorithm achieves a regret of O(√((T/r) log(N+T))). Both bounds are within logarithmic factors of the best possible performance any algorithm could achieve, even if it knew r in advance. Additionally, they provide a quick estimation procedure to determine which range r falls into, making the approach practical without prior knowledge of the observation probability.
- First algorithm achieves O(√((T/r) log N)) regret when r ≥ (log T)/(2N)
- Second algorithm achieves O(√((T/r) log(N+T))) regret for smaller r values
- Includes a quick estimation procedure to determine the range of r without prior knowledge
Why It Matters
These algorithms improve regret bounds in bandit problems with random side observations, enabling more efficient learning in real-world applications.