Research & Papers

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

arXiv cs.CL February 16, 2026

⚡This tiny model could finally fix the broken way we train AI.

Deep Dive

Researchers have released propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text across 18 specific properties like quality and reasoning depth, replacing simplistic single-score classifiers. The 4B model outperforms much larger general-purpose models in benchmark agreement. They also released a massive dataset of over three billion document annotations covering major pretraining corpora, enabling a new, multi-dimensional analysis of training data quality and composition.

Why It Matters

This provides a powerful, open-source tool to build higher-quality, more transparent, and safer LLMs by deeply understanding their training data.

Read Original Article

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Why It Matters

Stay Ahead in AI