Research & Papers

Counterfactual RL method prevents harm to individuals while optimizing rewards

A two-stage procedure ensures policies don't sacrifice individuals for average performance.

Deep Dive

Reinforcement learning algorithms typically optimize for average population returns, but this can produce policies that harm specific individuals. A new paper from Li, Wu, and Shi (arXiv:2605.25114) tackles this by first formalizing individual harm through a counterfactual lens: an action is harmful if it leads to a strictly worse outcome than an available alternative. The authors then propose a two-stage procedure that first estimates counterfactual outcomes and then learns a policy that balances expected return with a constraint on harm rate.

Crucially, the paper establishes finite-sample theoretical properties, including an upper bound on the sub-optimality gap of the learned policy and proof that the harm rate remains well-controlled. Experiments on both simulated environments and real-world datasets demonstrate that the method effectively reduces individual harm without sacrificing overall performance. This work bridges causal inference and safe RL, offering a principled way to deploy reinforcement learning in high-stakes domains where individual outcomes matter.

Key Points
  • Defines individual harm counterfactually: an action is harmful if it causes worse outcomes than a baseline alternative.
  • Proposes a two-stage procedure for learning policies that maximize expected return while controlling harm rate.
  • Provides finite-sample guarantees, including sub-optimality bound and harm rate control, validated on simulated and real-world data.

Why It Matters

Makes RL safer for healthcare, finance, and autonomous systems where individual outcomes cannot be ignored.