Hybrid Self-evolving Structured Memory for GUI Agents
A new graph-based memory system gives 7B-parameter models a +22.5% performance boost, letting them surpass giants like Gemini 2.5 Pro.
A team of researchers including Sibo Zhu, Wenyi Wu, and Kun Zhou has introduced a breakthrough memory architecture for AI agents that operate computer interfaces. Their paper, "Hybrid Self-evolving Structured Memory for GUI Agents," proposes HyMEM—a graph-based system designed to overcome the limitations of current agents in handling real-world computer tasks. These tasks are notoriously difficult due to long sequences of steps, diverse software interfaces, and frequent errors. Prior methods used flat, static memory retrievals, but HyMEM mimics the brain's structure by combining discrete, high-level symbolic nodes with continuous embeddings of past actions (trajectories). This hybrid graph supports complex, multi-step reasoning and can update itself dynamically.
Extensive testing shows HyMEM's impact is substantial. When integrated, it consistently elevates the performance of smaller, open-source vision-language models (VLMs). Most notably, it boosted the capabilities of the Qwen2.5-VL-7B model by 22.5%. This enhancement was enough for the 7-billion-parameter model to match and even surpass the performance of much larger, closed-source models like Gemini 2.5 Pro Vision and GPT-4o on GUI agent benchmarks. The architecture's key innovations—self-evolving nodes and on-the-fly working memory refresh—allow agents to better organize past experiences and adapt to new situations during task execution.
The research signifies a major step toward more efficient and capable autonomous AI assistants. By providing a superior memory framework, it reduces the reliance on simply scaling model size. This enables more accessible, smaller models to perform complex digital tasks—like data entry, software navigation, and workflow automation—at a level competitive with the most advanced proprietary systems, potentially democratizing powerful GUI automation technology.
- HyMEM uses a graph structure coupling symbolic nodes with continuous embeddings for superior memory organization and retrieval.
- It boosted the Qwen2.5-VL-7B model's performance by +22.5%, enabling it to outperform GPT-4o and Gemini2.5-Pro-Vision.
- The system self-evolves and refreshes working memory dynamically, allowing agents to handle long, complex computer workflows.
Why It Matters
Enables smaller, open-source AI models to automate complex computer tasks at a level rivaling top closed-source models, reducing cost and accessibility barriers.