Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
A novel approach to multi-armed bandits that maximizes variance or Wasserstein distance, not just expected reward.
The paper 'Concave Statistical Utility Maximization Bandits via Influence-Function Gradients' by Matías Carrasco and Alejandro Cholaquidis, submitted to arXiv on April 24, 2026, addresses a gap in multi-armed bandit research. Traditional bandit algorithms focus on maximizing expected cumulative reward, but many real-world applications require optimizing other statistical properties of the reward distribution, such as variance (for risk management) or Wasserstein distance (for distributional robustness). The authors show that under mild continuity assumptions, the infinite-horizon problem reduces to optimizing over stationary mixed policies, where each weight vector on the simplex induces a mixture law, and performance is measured by a concave utility function.
To solve this, they leverage influence-function calculus to derive stochastic gradient estimators from bandit feedback, enabling an entropic mirror-ascent algorithm on a truncated simplex. This algorithm uses multiplicative-weights updates and plug-in estimates of the influence function, achieving regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. Numerical experiments on variance and Wasserstein objectives demonstrate the effectiveness of the approach, comparing exact and plug-in influence-function implementations. This work has implications for fields like finance, where risk-adjusted returns matter, and robust statistics, where distributional properties beyond the mean are critical.
- Optimizes concave statistical utilities (variance, Wasserstein distance) instead of expected reward in multi-armed bandits
- Uses influence-function calculus to derive stochastic gradient estimators from bandit feedback
- Achieves regret bounds that separate optimization error from estimation bias via entropic mirror-ascent algorithm
Why It Matters
Enables bandit optimization for risk and distributional robustness, expanding beyond expected reward to real-world statistical objectives.