Research & Papers

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

New algorithm improves world model training by distinguishing between learnable patterns and inherent randomness.

Deep Dive

Researchers Vin Bhaskara and Haicheng Wang have introduced Curiosity-Critic, a novel approach to intrinsic reward design for training AI world models. Unlike traditional prediction-error-based curiosity methods that focus on immediate transitions, Curiosity-Critic grounds its reward in the improvement of cumulative prediction error across all visited states. The method reduces to a tractable per-step form: the difference between current prediction error and the asymptotic error baseline for each state transition. This baseline is estimated online by a learned critic that co-trains alongside the world model, converging well before the world model saturates.

Curiosity-Critic's key innovation is its ability to separate epistemic (reducible) from aleatoric (irreducible) prediction error in real-time. The reward remains high for learnable transitions where the world model can improve, but collapses toward the baseline for inherently stochastic ones. This effectively redirects exploration toward transitions that offer genuine learning opportunities rather than chasing unpredictable noise. The researchers show that prior curiosity formulations from Schmidhuber's 1991 work to modern learned-feature-space variants emerge as special cases corresponding to specific approximations of this baseline.

In experiments on a stochastic grid world environment, Curiosity-Critic demonstrated significant advantages over both prediction-error and visitation-count baselines. The method achieved approximately 2x faster convergence speed and higher final world model accuracy. The approach requires no oracle knowledge of the environment's noise floor, making it practical for real-world applications where stochasticity levels are unknown. The paper establishes a unified theoretical framework that connects decades of curiosity research while providing a more effective practical implementation.

Key Points
  • Uses cumulative prediction error improvement rather than single-step errors for intrinsic rewards
  • Co-trained critic estimates asymptotic error baselines, converging 2x faster than world model training
  • Separates epistemic from aleatoric error online without requiring oracle knowledge of noise levels

Why It Matters

Enables more efficient AI training by focusing exploration on genuinely learnable patterns rather than random noise.