Research & Papers

Design Experiments to Compare Multi-armed Bandit Algorithms

A new method called Artificial Replay reduces the user interactions needed to compare algorithms from 2T to T + o(T).

Deep Dive

Researchers from Columbia University and the University of Toronto have introduced a new, cost-effective method for comparing multi-armed bandit algorithms, a critical task for tech platforms that rely on them for dynamic recommendations and ad-serving. The paper, "Design Experiments to Compare Multi-armed Bandit Algorithms," tackles a major bottleneck: traditional testing requires many independent runs of each algorithm, consuming massive user traffic and delaying deployment. Their solution, Artificial Replay (AR), cleverly reuses data from a single run of a control policy to efficiently test a new one.

Artificial Replay works by first running one policy (e.g., UCB) and recording its sequence of actions and user rewards. When testing a second policy (e.g., Thompson Sampling), AR reuses the recorded reward whenever the new policy selects an action the first policy already took, only querying the real environment for novel choices. The team's analytical framework proves this yields an unbiased estimator. Crucially, it requires only T + o(T) total user interactions, nearly halving the cost compared to the standard 2T approach, and its variance grows sub-linearly with time T, ensuring more reliable comparisons.

The researchers validated AR's theoretical gains through numerical experiments with standard algorithms like UCB, Thompson Sampling, and ε-greedy. This innovation directly addresses the high cost and slow pace of live experimentation, allowing data science teams to iterate on and validate new bandit policies much faster. For any company running online experiments with adaptive algorithms, AR presents a practical framework to accelerate innovation while conserving valuable user interactions.

Key Points
  • Artificial Replay (AR) is an unbiased experimental design that reuses reward data from a control policy to test a new one.
  • The method reduces required user interactions from 2T to T + o(T), cutting experimental cost nearly in half for sub-linear regret policies.
  • AR's estimator variance grows sub-linearly in T, providing more reliable inference than naïve designs with linearly-growing variance.

Why It Matters

Enables faster, cheaper deployment of optimized recommendation and ad-serving algorithms for major online platforms.