DIM-WAM boosts robot task success from 28% to 70% with memory augmentation
New memory-augmented world-action model achieves 91.5% stage success on real robots
World-action models predict future visual states and actions jointly, but existing methods struggle with long-horizon tasks that depend on earlier observations and task progress. Researchers introduce DIM-WAM, a memory-augmented approach that extracts compact visual event information from real observations, updates multiple memory banks through independent similarity-based merging, and reads bank-identity- and time-embedded long-term context to condition video and action denoising. A progress-supervision objective further forces memory tokens to encode not only completed historical events but also the current task stage and its implications.
On the RMBench benchmark, DIM-WAM achieved 69.8% average success, dramatically improving over the 28.4% baseline from LingBot-VA and outperforming the explicit-memory Mem-0 baseline at 42.0%. In real-world experiments on four Franka robot tasks, stage success improved from 70.7% to 91.5% and full-task success from 52.5% to 80.0%. These results demonstrate that properly remembering and utilizing multi-scale historical context is key to enabling robots to handle complex, sequentially dependent manipulation tasks.
- Achieves 69.8% success on RMBench, up from 28.4% baseline and 42% Mem-0 explicit-memory baseline
- Real-world Franka tasks: 91.5% stage success and 80% full-task success, vs 70.7% and 52.5% baselines
- Uses multi-scale memory banks with similarity-based merging and a progress-supervision objective to track task stage
Why It Matters
Enables robots to remember long-horizon task context, critical for complex real-world manipulations.