Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
A new framework passively converts AI's internal video representations into a discrete, interpretable symbolic language.
Independent researcher Hung Ming Liu has introduced the AI Mother Tongue (AIM) framework, a novel method for interpreting the internal representations of advanced video AI models. The framework acts as a passive quantization probe for Meta's V-JEPA 2, a Joint Embedding Predictive Architecture (JEPA) model trained to predict masked video regions in a compressed latent space. Unlike generative models, V-JEPA 2's latent representations are not directly inspectable, creating an 'interpretability gap.' AIM addresses this by converting the model's continuous latent vectors into discrete symbol sequences without any task-specific supervision and, crucially, without modifying or fine-tuning the frozen V-JEPA 2 encoder. This ensures any discovered symbolic structure is attributable solely to the pre-trained model's learned representations.
In rigorous testing on the Kinetics-mini video dataset, the AIM probe revealed that V-JEPA 2's latent space has learned a surprisingly structured and compact representation of the physical world. The probe's symbol distributions showed statistically significant differences (chi^2 p < 10^{-4}) across three physical dimensions: grasp angle, object geometry, and motion temporal structure. The results indicate that diverse action categories share a common representational core, with semantic differences encoded as subtle, graded variations in symbol distributions rather than hard categorical boundaries. This work establishes the first stage of a four-stage roadmap, demonstrating that structured symbolic manifolds are a discoverable property of frozen JEPA models, paving the way for more interpretable and actionable AI world models.
- The AI Mother Tongue (AIM) framework is a passive, vocabulary-free probe that converts Meta's V-JEPA 2 latent vectors into discrete symbols without modifying the model.
- Experiments on Kinetics-mini showed the latent space encodes physical concepts like grasp angle and motion structure with high statistical significance (chi^2 p < 10^{-4}).
- The research reveals V-JEPA 2's latent space is compact, using graded distributional variations to encode semantics, and is Stage 1 of a roadmap to a symbolic world model.
Why It Matters
This work provides a crucial tool for peering inside 'black box' AI video models, a major step toward building interpretable, reliable, and actionable AI agents.