Achieves 1.91×, 1.45×, and 3.51× higher mean inter-sample interval on inverted pendulum, cart-pole, and quadrotor vs. Lyapunov-triggered baseline?

Achieves 1.91×, 1.45×, and 3.51× higher mean inter-sample interval on inverted pendulum, cart-pole, and quadrotor vs. Lyapunov-triggered baseline.

Fixed LQR controller at same average rate is unstable on all three plants; adaptive timing is essential for safe sparsity?

Fixed LQR controller at same average rate is unstable on all three plants; adaptive timing is essential for safe sparsity.

Preference-conditioned extension recovers full tradeoff frontier using only 2/11 of training compute?

Preference-conditioned extension recovers full tradeoff frontier using only 2/11 of training compute.

Research & Papers

New RL method learns when to act, boosting communication efficiency 3.5x

arXiv cs.LG May 14, 2026

⚡A Lyapunov safety shield lets RL agents act only when needed, saving bandwidth

Deep Dive

Researchers from the paper 'Learning When to Act' propose a novel reinforcement learning (RL) framework that shifts focus from what an agent should do to when it needs to act. Their method combines a pointwise Lyapunov safety shield with a communication-efficient policy that jointly learns control inputs and timing decisions. The run-time assurance (RTA) layer overrides the policy via one-step-ahead Lyapunov predictions and a precomputed LQR backup, providing stronger safety guarantees than constrained MDP methods that only enforce safety in expectation. This allows the agent to act sparsely—only when necessary—while ensuring stability around a known equilibrium.

On benchmark tasks including an inverted pendulum, cart-pole, and planar quadrotor, the learned policy achieves mean inter-sample intervals (MSI) 1.91×, 1.45×, and 3.51× higher than a Lyapunov-triggered baseline. Critically, a fixed LQR controller operating at the same average rate is unstable across all three environments, proving that adaptive timing—not a lower average rate—enables safe sparsity. The method extends to higher-dimensional systems with a 12-state 3D quadrotor, showing robustness to ±30% mass variation and disturbances. A preference-conditioned extension recovers the full tradeoff frontier from a single model at only 2/11 of training compute, and experiments with SAC confirm the results are algorithm-agnostic across discrete and continuous domains.

Key Points

Achieves 1.91×, 1.45×, and 3.51× higher mean inter-sample interval on inverted pendulum, cart-pole, and quadrotor vs. Lyapunov-triggered baseline.
Fixed LQR controller at same average rate is unstable on all three plants; adaptive timing is essential for safe sparsity.
Preference-conditioned extension recovers full tradeoff frontier using only 2/11 of training compute.

Why It Matters

Reduces communication bandwidth for autonomous systems, enabling safer, more efficient real-time control with fewer actions.

Read Original Article

New RL method learns when to act, boosting communication efficiency 3.5x

Why It Matters

Related Articles

🚀 Stay Ahead in AI