Anchor policy ensures safe fallback?

action = anchor + delta*gate, preventing complete policy collapse.

Hierarchical actor with three independent MLP optimizers (pitch → roll → rest) stops cross-corruption of gradients?

Hierarchical actor with three independent MLP optimizers (pitch → roll → rest) stops cross-corruption of gradients.

Mirror learning doubles sample efficiency by leveraging left-right symmetry for every transition?

Mirror learning doubles sample efficiency by leveraging left-right symmetry for every transition.

Research & Papers

NOML-NOML: Custom RL algorithm for stable 6-DoF flight control

r/MachineLearning May 20, 2026

⚡A new open-source RL algorithm solves pitch oscillation in continuous flight control.

Deep Dive

A Reddit user has open-sourced NOML (NOML-NOML), a custom reinforcement learning algorithm designed to solve a persistent problem in continuous flight control: oscillation collapse. Training a standard TD3 agent on a 6-DoF flight simulator (pitch, roll, yaw, throttle, brake, fire) would consistently see peak performance followed by pitch oscillations that never recovered. After ruling out reward shaping issues, the developer identified the root cause as structural and created three key modifications.

The first modification is an anchor policy: the action output becomes "anchor + delta*gate", where the anchor is a stable flight behavior (wings level, military throttle). Even a collapsed policy cannot fully forget how to fly straight—it defaults to the anchor. Second, the actor is hierarchical: three separate MLPs with independent optimizers for pitch, roll, and the remaining actions. This prevents gradient updates for roll from corrupting the pitch head, which eliminated oscillation. Third, mirror learning exploits left-right symmetry to generate a free second sample from every transition, effectively doubling data when environment steps are the bottleneck. Surprising finding: exploration noise hurt performance; the anchor-gate structure replaced its role. The code (Apache 2.0), full write-up, and test video are on GitHub.

Key Points

Anchor policy ensures safe fallback: action = anchor + delta*gate, preventing complete policy collapse.
Hierarchical actor with three independent MLP optimizers (pitch → roll → rest) stops cross-corruption of gradients.
Mirror learning doubles sample efficiency by leveraging left-right symmetry for every transition.

Why It Matters

This structural approach could enable stable RL for real-world drones and aircraft control with reduced tuning.

Read Original Article

NOML-NOML: Custom RL algorithm for stable 6-DoF flight control

Why It Matters

Related Articles

🚀 Stay Ahead in AI