ICML 2026 paper argues deployed RL must be continual, not train-then-fix
Reinforcement learning systems that stop learning after deployment are fundamentally suboptimal, researchers say.
Most real-world reinforcement learning (RL) systems follow a train-then-fix cycle: agents are trained, deployed, and only retrained when performance degrades. In a position paper accepted to the ICML 2026 Position Paper Track, researchers Parnian Behdin, Kevin Roice, and Golnaz Mesbahi argue this approach is fundamentally flawed. They claim that any deployed RL agent receiving evaluative reward signals faces inherent non-stationarity that demands continual learning—not periodic retraining. The paper pinpoints four sources of post-deployment non-stationarity: shifts in the environment, changes in user behavior, evolving system dynamics, and new task objectives. Each makes static agents suboptimal over time.
The authors highlight existing real-world success stories of continual RL—such as adaptive recommendation systems and robotics that refine policies during operation—to show the approach is practical. They urge the community to abandon the train-then-fix paradigm in favor of architectures that support never-ending adaptation. This shift promises more robust, efficient AI systems, especially in high-stakes domains like autonomous driving, healthcare, and industrial automation, where a frozen policy can quickly become outdated.
- Identifies four distinct sources of non-stationarity that static RL agents face post-deployment
- Accepted to the ICML 2026 Position Paper Track, signaling growing academic interest
- Argues the train-then-fix paradigm should be replaced by continual learning for optimal real-world performance
Why It Matters
Real-world RL systems (autonomous driving, robotics, recommendations) degrade without continuous adaptation; this paper pushes a needed paradigm shift.