A Family of LLMs Liberated from Static Vocabularies
New architecture processes raw bytes instead of tokens, improving compression by 30% and handling spelling variations.
Aleph Alpha, the European AI research lab, has published a groundbreaking paper introducing a family of large language models that fundamentally change how AI processes text. Their Hierarchical Autoregressive Transformer (HAT) architecture eliminates the need for traditional tokenization by processing raw bytes directly. The system uses an encoder to aggregate bytes into word embeddings, feeds them to a transformer backbone (adapted from existing models like Llama 3.1), then uses a decoder to convert outputs back to bytes. This approach liberates models from the constraints of static vocabularies that have limited current LLMs.
The team demonstrated their approach by converting Meta's Llama 3.1 8B and 70B models into the HAT architecture, creating Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT. They also trained a completely new 7B parameter model, Llama-TFree-HAT-Pretrained, from scratch on nearly 4 trillion words of training data. The architecture shows significant improvements in text compression—reducing the number of required sequence positions—and handles intra-word variations like spelling differences more robustly than token-based approaches.
After supervised fine-tuning and direct preference optimization in both English and German, the HAT models outperform the original Llama 3.1 models on most benchmarks while maintaining strong multilingual proficiency. The researchers have released their models, including 200 pre-training checkpoints, on Hugging Face, making this byte-level approach accessible to the broader AI community. This represents a fundamental shift in how language models process text, potentially enabling more efficient and adaptable AI systems.
- Eliminates traditional tokenization by processing raw bytes directly through HAT architecture
- Converted Llama 3.1 8B and 70B models plus trained new 7B model on 4 trillion words
- Improves text compression and handles spelling variations better than token-based models
Why It Matters
Eliminates vocabulary limits, improves multilingual performance, and makes AI more adaptable to new domains and languages.