propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
This tiny model could finally fix the broken way we train AI.
Researchers have released propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text across 18 specific properties like quality and reasoning depth, replacing simplistic single-score classifiers. The 4B model outperforms much larger general-purpose models in benchmark agreement. They also released a massive dataset of over three billion document annotations covering major pretraining corpora, enabling a new, multi-dimensional analysis of training data quality and composition.
Why It Matters
This provides a powerful, open-source tool to build higher-quality, more transparent, and safer LLMs by deeply understanding their training data.