Value is trained on a rolling window of self-play data from the latest models, with geometric weighting favoring recent experience?

Value is trained on a rolling window of self-play data from the latest models, with geometric weighting favoring recent experience.

Dirichlet noise perturbs PUCT-based move selection, so training data includes stochastic outlier moves that have little impact on value predictions?

Dirichlet noise perturbs PUCT-based move selection, so training data includes stochastic outlier moves that have little impact on value predictions.

The value reflects average strength against predecessor models, not against any arbitrary strong opponent, raising generalization concerns?

The value reflects average strength against predecessor models, not against any arbitrary strong opponent, raising generalization concerns.

Research & Papers

DeepMind's AlphaZero Value Predictions: Self-Play Training vs. Real Opponents

r/MachineLearning May 11, 2026

⚡AlphaZero's value function is shaped by its own style—can it beat a completely different opponent?

Deep Dive

DeepMind's AlphaZero learns to predict the value of a game state by training on data generated through self-play of the current model and its predecessors. This value is designed to reflect the probability of winning against a copy of itself, but the actual calculation averages strength against opponents among all predecessor models, weighted by a rolling window that emphasizes recent play. The agent's moves are governed by a PUCT function using these predicted values, but with a stochastic element: Dirichlet noise is added to encourage exploration. This means the training data includes 'outlier' moves, making the value prediction an oversimplification of win probability against the exact same agent.

The practical consequence is that AlphaZero's value predictions are primarily governed by its own playing style and its historical development—not by direct experience against a wide variety of opponents. Outlier moves occur so infrequently that they have minimal impact on the value function. While AlphaZero has empirically outperformed humans and other algorithms in many games, the author questions whether this success is theoretically guaranteed. Could the model fail against a specific algorithm whose moves, though technically present in training data, are so rare that they shape the value predictions negligibly? This highlights a fundamental tension: self-play training produces powerful agents, but their internal value estimates may not generalize to opponents with fundamentally different strategies.

Key Points

Value is trained on a rolling window of self-play data from the latest models, with geometric weighting favoring recent experience.
Dirichlet noise perturbs PUCT-based move selection, so training data includes stochastic outlier moves that have little impact on value predictions.
The value reflects average strength against predecessor models, not against any arbitrary strong opponent, raising generalization concerns.

Why It Matters

Challenges assumption that self-play-based value functions are robust against opponents with very different playing styles.

Read Original Article

DeepMind's AlphaZero Value Predictions: Self-Play Training vs. Real Opponents

Why It Matters

Related Articles

Stay Ahead in AI