Outperforms GPT-2 (124M) by 13.3% on BLiMP and 24% on SciQ despite being 2.5x smaller?

Outperforms GPT-2 (124M) by 13.3% on BLiMP and 24% on SciQ despite being 2.5x smaller.

Llama-style architecture with 50M parameters, trained on 20B tokens from fineweb-edu using a single GPU?

Llama-style architecture with 50M parameters, trained on 20B tokens from fineweb-edu using a single GPU.

First model in SupraLabs' Scaling Up Plan; upcoming releases include Supra-124M and Supra-350M?

First model in SupraLabs' Scaling Up Plan; upcoming releases include Supra-124M and Supra-350M.

Open Source

SupraLabs' Supra-50M outperforms GPT-2 despite being 2.5x smaller

r/LocalLLaMA May 22, 2026

⚡A 50M-parameter model beats GPT-2 (124M) on science and linguistics benchmarks.

Deep Dive

SupraLabs has released Supra-50M, a compact 50M-parameter causal language model (Base and Instruct variants) built from scratch using a Llama-style decoder-only transformer architecture. The model was trained on 20 billion tokens of high-quality educational web text from the HuggingFaceFW/fineweb-edu dataset, using a single GPU with bfloat16 precision and torch.compile. Key architectural choices include tied embeddings, grouped query attention (4 KV heads), a custom Byte-Level BPE tokenizer with a vocabulary of 32,000, and 1,024 maximum position embeddings. Despite its small size, Supra-50M delivers impressive inference efficiency on consumer hardware.

In benchmark comparisons, Supra-50M (50M parameters) significantly outperforms GPT-2 (124M parameters) on BLiMP (76.3% vs 63%), SciQ (77.2% vs 53.2%), and ARC-Easy (52.2% vs 42%). It also beats or ties larger models like SmolLM-135M and OpenELM-270M on several tasks, demonstrating that data quality and architecture matter more than raw parameter count. The model represents the first release in SupraLabs' Scaling Up Plan, with Supra-124M (Base, Chat, Experimental Reasoning) and Supra-350M (Base, Chat, Reasoning, Coding) models coming next.

Key Points

Outperforms GPT-2 (124M) by 13.3% on BLiMP and 24% on SciQ despite being 2.5x smaller.
Llama-style architecture with 50M parameters, trained on 20B tokens from fineweb-edu using a single GPU.
First model in SupraLabs' Scaling Up Plan; upcoming releases include Supra-124M and Supra-350M.

Why It Matters

Shows that compact, well-trained models can rival larger ones, enabling efficient deployment on consumer hardware.

Read Original Article

SupraLabs' Supra-50M outperforms GPT-2 despite being 2.5x smaller

Why It Matters

Related Articles

🚀 Stay Ahead in AI