ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
New research bypasses complex pipelines to find syllables directly in raw audio using frozen WavLM features.
Researchers Nicol Visser, Simon Malan, Danel Slabbert, and Herman Kamper developed ZeroSyl, a zero-resource syllable tokenizer for spoken language models. It uses L2 norms from intermediate layers of a frozen WavLM model to identify syllable boundaries without training. The method outperforms prior complex systems like Sylber and SyllableLM on lexical, syntactic, and narrative benchmarks, offering a simpler path to building pure speech language models from raw audio.
Why It Matters
Simplifies creating AI that understands language directly from speech, crucial for low-resource languages without written text.