AI world models learn spatial semantics from physical exploration, no language needed
A new paper shows world models build geometric understanding of space just by moving around
A new paper from arXiv (2605.28865) by Jiayi Fang demonstrates that world models can learn semantic representations of space purely through physical interaction, without any language supervision. Training a VAE-based world model on random embodied exploration, the latent space developed organized spatial semantics that mirror the geometric structure of the real world. Direction accuracy reached 0.677 ± 0.029 versus 0.547 for a randomly initialized encoder — a statistically significant improvement. Position representational similarity (RSA) jumped to 0.192 ± 0.047, a 6.6x improvement over random encoders, confirming that training induces genuine structural organization beyond the inductive bias of CNNs.
The study also uncovered a critical role for KL regularization: with standard beta=0.1, the encoder was forced away from geometric structure, causing both prediction performance and semantic alignment to collapse to near-chance by step 50,000. Reducing beta to 0.001 restored geometric access and recovered both capabilities together. This supports the "shared-driver" account where prediction and semantic alignment co-improve (Spearman r=-0.61, p=0.004). The findings establish physical world geometry as the natural organizing principle for world model representations, with direct implications for designing embodied agents that can ground meaning without relying on language labels.
- Direction accuracy 0.677 vs 0.547 baseline, position RSA 6.6x better after training on random physical exploration
- Standard KL regularization (beta=0.1) killed both prediction and semantic alignment; lowering beta to 0.001 restored both
- 20 temporal checkpoints show prediction performance and semantic structure co-improve (Spearman r=-0.61, p=0.004)
Why It Matters
Enables AI agents to understand space and semantics through movement alone, reducing dependency on labeled data.