Agent Frameworks

Memory-R2 framework fixes unfair credit assignment in long-horizon LLM agents

New algorithm LoGo-GRPO enables fair comparison across memory-altered rollouts with 32-session training.

Deep Dive

Memory-augmented LLM agents can store, update, and reuse information across sessions, but training them with reinforcement learning is tricky: once different rollouts write different memories, they no longer share the same environment. This violates the key assumption behind group-relative methods like GRPO, where rollouts are compared as if sampled from the same state, leading to noisy or biased credit signals for memory operations.

To solve this, researchers from LMU Munich, Siemens, and elsewhere introduce Memory-R2 with a novel core algorithm called LoGo-GRPO. LoGo-GRPO combines local group-relative optimization—rerolling different memory operations from the same intermediate memory state for fairer comparisons—with global trajectory-level rewards that preserve end-to-end learning. The framework also uses a shared-parameter co-learning design: a fact extractor and memory manager are instantiated from the same LLM backbone using role-specific prompts, jointly optimizing memory formation and evolution. To stabilize multi-step RL over long horizons, Memory-R2 employs a progressive curriculum that gradually increases training from 8 to 16 to 32 sessions.

Key Points
  • LoGo-GRPO algorithm combines local rerollouts (fair comparisons from identical memory states) with global trajectory rewards for precise credit assignment.
  • Shared-parameter co-learning lets fact extractor and memory manager share the same LLM backbone via role-specific prompts, optimizing both memory formation and evolution jointly.
  • Progressive curriculum scales training horizon incrementally from 8 to 16 to 32 sessions, stabilizing multi-step reinforcement learning over long memory spans.

Why It Matters

Enables reliable training of LLM agents that persist memory across dozens of sessions, crucial for real-world assistants and autonomous workflows.