1-bit per batch is enough: new bandit algorithms near optimal regret
Linear bandits with just one bit of feedback per batch match unconstrained performance.
A team of researchers studied stochastic linear bandits under a combination of batching and communication constraints: the time horizon is split into batches of equal size B, and during each batch the learner sends B arm pulls, then the agent responds with a single bit of feedback. They proved a minimax lower bound and designed two phased-elimination algorithms that achieve regret within logarithmic factors of the unconstrained setting—even for batch sizes as large as Θ(√T). This shows a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes.
- Setting: batched linear bandits where agent returns only 1 bit per batch; learner designs quantization rule per batch.
- Minimax lower bound: Ω(B min{d, log|A|}) regret from communication alone, plus standard statistical terms.
- Two algorithms achieve Õ(dB + d√T) and Õ(B log|A| + d^{3/2}√B + √(dT log|A|)) regret, nearly matching lower bounds.
Why It Matters
Enables near-optimal learning with extreme communication constraints, key for distributed sensing and low-power AI agents.