The Yokai Learning Environment: Tracking Beliefs Over Space and Time
New benchmark reveals top AI coordination methods fail when agents must track moving beliefs with unseen partners.
A team of researchers including Constantin Ruhdorfer, Matteo Bortoletto, and Jakob Foerster has introduced the Yokai Learning Environment (YLE), a new open-source benchmark designed to rigorously test AI's ability to cooperate with unknown partners, a field known as zero-shot coordination (ZSC). The YLE was created because the previous dominant benchmark, the Hanabi Learning Environment (HLE), has been largely solved, with top methods achieving near-perfect performance, thus limiting its ability to track further algorithmic progress.
The YLE presents a significantly harder challenge by requiring agents to build 'common ground' through three core tasks absent in Hanabi: tracking and updating beliefs about cards that move positions, reasoning under intentionally ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge. In Hanabi, beliefs are tied to static hand slots and hints are truthful by rule, but YLE's dynamic, ambiguous environment forces agents to develop robust internal models of their partners' beliefs.
When the researchers evaluated state-of-the-art ZSC algorithms—including High-Entropy IPPO, Other-Play, and Off-Belief Learning—they found these methods, which excel in HLE, exhibited persistent gaps between self-play and cross-play performance in YLE. The agents showed degraded calibration for ending games early and weaker belief representations when paired with unseen partners, proving they fail to maintain consistent internal models. Crucially, the methods that perform best on HLE do not perform best on YLE, demonstrating that progress on a single benchmark does not guarantee generalizable cooperative intelligence.
- The Yokai Learning Environment (YLE) is a new open-source benchmark that tests AI's ability to cooperate with unknown partners through belief tracking, ambiguous hints, and shared knowledge inference.
- Leading zero-shot coordination methods like High-Entropy IPPO and Off-Belief Learning, which achieve near-perfect scores on the older Hanabi benchmark, show significant performance drops and belief inconsistencies in YLE.
- The results prove that benchmarking on a single environment like Hanabi is insufficient, as top-performing methods there fail to generalize, establishing YLE as a necessary, more challenging standard for cooperative AI progress.
Why It Matters
This exposes a critical flaw in current cooperative AI: models that excel in simplified tests fail at real-world teamwork requiring dynamic belief-sharing with strangers.