Research & Papers

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

r/MachineLearning April 06, 2026

⚡Trained in 16 days on 2×H200 GPUs, this model fixes Italian tokenization that wastes 20-30% of context windows.

Deep Dive

Independent developer [P] has completed Phase 1 of training Dante-2B, a 2.1B parameter, decoder-only transformer built from the ground up for native Italian and English fluency. Unlike common approaches that fine-tune English models, Dante-2B started with random initialization and was trained on a 300B token corpus over 16 days using 2×H200 GPUs. The architecture is LLaMA-style with Grouped-Query Attention (20 query heads, 4 KV heads), SwiGLU FFN, and RoPE, optimized for Flash Attention. The project's core innovation is a custom 64K BPE tokenizer trained on a character-balanced mix of ~42% Italian, ~36% English, and ~22% code data. This tokenizer treats Italian apostrophe contractions and accented characters as atomic units, solving a critical flaw where standard tokenizers waste 20-30% of a model's context window on inefficient Italian text segmentation.

After training on 100B tokens at a sequence length of 2048, the model already generates coherent Italian text with proper grammar and article use. The training used DeepSpeed ZeRO-2, torch.compile, and FP8 precision, achieving a 28% Model FLOPs Utilization (MFU) with no stability issues. Phase 2, currently in progress, will extend the context window to 4096 over 4-7 more days. Upon completion, the plan includes a full HuggingFace release of the base model, followed by a Supervised Fine-Tuning (SFT) phase for instruction following. Dante-2B represents a significant step toward efficient, specialized open-source models for non-English languages, moving beyond the 'afterthought' treatment common in multilingual LLMs.

Key Points

Custom 64K BPE tokenizer treats Italian contractions and accents as single tokens, fixing a 20-30% context window inefficiency.
Trained from scratch on 2×H200 GPUs for 16 days using a 300B token corpus of Italian web text, literature, legal docs, and code.
Phase 1 complete: 2.1B parameter model generates coherent Italian; Phase 2 will extend context to 4096 tokens ahead of open release.

Why It Matters

It provides a high-efficiency, open-source foundation for Italian NLP applications, moving beyond inefficient fine-tunes of English models.

Read Original Article

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

Why It Matters

Stay Ahead in AI