SupraLabs' Supra-50M outperforms GPT-2 despite being 2.5x smaller
A 50M-parameter model beats GPT-2 (124M) on science and linguistics benchmarks.
SupraLabs has released Supra-50M, a compact 50M-parameter causal language model (Base and Instruct variants) built from scratch using a Llama-style decoder-only transformer architecture. The model was trained on 20 billion tokens of high-quality educational web text from the HuggingFaceFW/fineweb-edu dataset, using a single GPU with bfloat16 precision and torch.compile. Key architectural choices include tied embeddings, grouped query attention (4 KV heads), a custom Byte-Level BPE tokenizer with a vocabulary of 32,000, and 1,024 maximum position embeddings. Despite its small size, Supra-50M delivers impressive inference efficiency on consumer hardware.
In benchmark comparisons, Supra-50M (50M parameters) significantly outperforms GPT-2 (124M parameters) on BLiMP (76.3% vs 63%), SciQ (77.2% vs 53.2%), and ARC-Easy (52.2% vs 42%). It also beats or ties larger models like SmolLM-135M and OpenELM-270M on several tasks, demonstrating that data quality and architecture matter more than raw parameter count. The model represents the first release in SupraLabs' Scaling Up Plan, with Supra-124M (Base, Chat, Experimental Reasoning) and Supra-350M (Base, Chat, Reasoning, Coding) models coming next.
- Outperforms GPT-2 (124M) by 13.3% on BLiMP and 24% on SciQ despite being 2.5x smaller.
- Llama-style architecture with 50M parameters, trained on 20B tokens from fineweb-edu using a single GPU.
- First model in SupraLabs' Scaling Up Plan; upcoming releases include Supra-124M and Supra-350M.
Why It Matters
Shows that compact, well-trained models can rival larger ones, enabling efficient deployment on consumer hardware.