Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Unsupervised neural networks spontaneously combine words after hearing only single words.
A team led by Gašper Beguš at UC Berkeley has uncovered a striking phenomenon in unsupervised deep neural networks: models trained solely on acoustic recordings of single words spontaneously begin generating outputs with two or even three words concatenated together. The study, published on arXiv (2305.01626), uses ciwGAN/fiwGAN architectures—convolutional neural networks trained in a generative adversarial setting on raw speech waveforms. Despite never being exposed to multi-word sequences, the models produce concatenated speech that mimics basic syntactic structure, a property the researchers term 'spontaneous concatenation.' This behavior was replicated across multiple independently trained models with varying hyperparameters and training datasets, suggesting it is a robust emergent property rather than a fluke.
The implications are significant for both AI and neuroscience. The researchers also found that networks trained on two words could embed those words into novel, unobserved combinations, demonstrating precursors to compositionality—a cornerstone of human language. To explain this, they formalize a neural mechanism called 'disinhibition,' which outlines a possible pathway for how concatenation and compositionality could arise in both artificial and biological neural networks. This work challenges the predominantly text-based approach to computational syntax, proposing instead that the most basic step in syntax evolution can be modeled directly from raw acoustic inputs. It opens new avenues for understanding how the brain might process spoken language without explicit symbolic representations, and offers testable predictions for future experiments in neural processing.
- ciwGAN/fiwGAN models trained on single-word audio spontaneously generate 2-3 word concatenations without multi-word training data.
- The phenomenon replicates across multiple models with different hyperparameters and datasets, indicating robust emergence.
- Models also embed words into novel unobserved combinations, showing precursors to compositionality—a key feature of syntax.
Why It Matters
This reveals syntax can emerge from raw speech alone, reshaping AI language models and neuroscience of language.