NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
New Portuguese language model beats baselines with 0.9191 F1 on MRPC
A team of researchers led by Enzo S. N. Silva has released NorBERTo, a new encoder-only language model for Portuguese built on the ModernBERT architecture. The model is trained on Aurora-PT, a newly curated corpus of 331 billion GPT-2 tokens sourced from diverse web content and existing multilingual datasets — making it the largest openly available monolingual Portuguese corpus to date. NorBERTo leverages ModernBERT's long-context support and efficient attention mechanisms, enabling it to handle longer documents while remaining computationally practical for real-world deployment.
In systematic benchmarks against strong baselines like BERTimbau and Albertina PT-BR, NorBERTo-large achieved top scores on the PLUE benchmark suite, notably 0.9191 F1 on MRPC (paraphrase detection) and 0.7689 accuracy on RTE (recognizing textual entailment). On ASSIN 2, it attained the highest entailment F1 (~0.904) among all encoder models evaluated, though larger models still lead on some tasks. The model is designed for straightforward fine-tuning and efficient serving, making it a strong candidate for downstream applications such as semantic search, classification, and retrieval-augmented generation in Portuguese.
- NorBERTo is based on ModernBERT architecture with long-context support and efficient attention mechanisms.
- Trained on Aurora-PT, the largest open monolingual Portuguese corpus (331B GPT-2 tokens).
- Outperforms baselines on PLUE: 0.9191 F1 on MRPC and 0.7689 accuracy on RTE.
Why It Matters
Advances Portuguese NLP with a modern, efficient encoder for fine-tuning and retrieval-augmented generation.