MP-STRUCT pre-training slashes LLM training data needs by imitating human language acquisition
A 500-step pre-training on a synthetic language matches human-like data efficiency in LLMs.
Large Language Models (LLMs) remain far less data-efficient than humans, requiring massive corpora to learn language. Inspired by the Language Acquisition Device (LAD) hypothesis — which posits that innate constraints guide human language learning — researchers Masato Mita, Taiga Someya, Ryo Yoshida, and Yohei Oseki propose pre-pretraining (PPT) on a novel synthetic language called MP-STRUCT. Unlike prior work focusing on highly expressive formal languages like k-Shuffle Dyck, MP-STRUCT encodes three core linguistic operations: MERGE (hierarchical composition), AGREE (feature-based dependencies), and MOVE (long-distance displacement). This design mimics the innate structural biases that allow humans to learn language from limited data.
Remarkably, just 500 steps of PPT with MP-STRUCT matches existing strong baselines in token efficiency while also making LLMs resistant to unnatural language patterns like REVERSE. The researchers find that MP-STRUCT CORE outperforms k-Shuffle Dyck despite not being definable in C-RASP, a formal bound on transformer expressivity. This challenges the prior assumption that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. The key factor is functional landmarks: structural cues that reduce ambiguity in resolving dependencies. Accepted to ACL2026, this work suggests that effective PPT design hinges not just on expressivity but on the accessibility of dependency resolution, bringing LLMs one step closer to human-like data efficiency.
- MP-STRUCT encodes MERGE, AGREE, and MOVE operations to mimic innate human language constraints.
- 500-step pre-pretraining matches strong baselines in token efficiency while resisting structurally implausible languages like REVERSE.
- MP-STRUCT CORE outperforms k-Shuffle Dyck despite not being definable in C-RASP, a formal bound on transformer expressivity.
Why It Matters
This research could dramatically reduce training data needs for LLMs, making them more efficient and robust like human language acquisition.