Research & Papers

Online learning with Erd\H{o}s-R\'enyi side-observation graphs

Two algorithms for multi-armed bandits with random side observations achieve near-optimal regret bounds.

Deep Dive

In a paper published at ICML 2015, researchers Tomáš Kocák, Gergely Neu, and Michal Valko tackle the problem of online learning with side observations in adversarial multi-armed bandit settings. Specifically, they consider a scenario where after pulling an arm, the learner observes not only the loss of the chosen arm but also the losses of all other arms, each with a fixed but unknown probability r, independent of each other and the learner's actions. This setup models real-world situations where partial feedback is available, such as in recommendation systems or network routing.

The authors propose two algorithms tailored to different ranges of r. The first algorithm works when r is relatively large (r ≥ (log T)/(2N)), achieving a regret bound of O(√((T/r) log N)). For smaller r, the second algorithm achieves a regret of O(√((T/r) log(N+T))). Both bounds are within logarithmic factors of the best possible performance any algorithm could achieve, even if it knew r in advance. Additionally, they provide a quick estimation procedure to determine which range r falls into, making the approach practical without prior knowledge of the observation probability.

Key Points
  • First algorithm achieves O(√((T/r) log N)) regret when r ≥ (log T)/(2N)
  • Second algorithm achieves O(√((T/r) log(N+T))) regret for smaller r values
  • Includes a quick estimation procedure to determine the range of r without prior knowledge

Why It Matters

These algorithms improve regret bounds in bandit problems with random side observations, enabling more efficient learning in real-world applications.