ERFSL uses LLMs to auto-tune reward functions 5x faster
New method corrects reward codes in one iteration, handles 500x weight errors
A team of researchers from multiple institutions has introduced ERFSL (Efficient Reward Function Searcher via Language Models), a system that leverages large language models (LLMs) to automate the tedious process of designing reward functions for reinforcement learning in custom environments. The approach addresses multi-objective optimization, where agents must balance competing goals like speed, safety, and energy efficiency.
ERFSL works in three steps: first, an LLM (tested with GPT-4o mini) generates reward function components based on explicit user requirements. Then a reward critic reviews and corrects the code—requiring only one feedback iteration per requirement. Finally, a weight optimizer iteratively adjusts the importance of each reward component using textual logs from training. In simulation benchmarks, even when a weight was off by a factor of 500, the system needed just 5.2 iterations on average to satisfy user requirements. This makes reward design accessible without deep reinforcement learning expertise, accelerating the development of AI controllers for robotics, autonomous vehicles, and industrial systems.
- ERFSL uses GPT-4o mini to generate reward components from user requirements, needing only one feedback iteration for correction per requirement
- Weight optimizer converges in average 5.2 iterations even when initial weight is off by a factor of 500x
- Designed for custom-environment multi-objective learning, tested on simulation-based benchmarks for robotics and control tasks
Why It Matters
Automates complex reward engineering, cutting development time for multi-objective AI systems from days to minutes.