Research & Papers

Bulding my own Diffusion Language Model from scratch was easier than I thought [P]

Training a diffusion LM from scratch in hours on a MacBook Air M2...

Deep Dive

In a refreshing break from AI-assisted development, a developer (Reddit user Encrux615) decided to build a discrete diffusion language model completely from scratch—no Claude Code, no Copilot, just raw coding while waiting for their master's thesis training to complete. The result is a 7.5 million parameter diffusion language model trained on a MacBook Air M2 using Karpathy's tiny Shakespeare dataset. The model uses a vocabulary of just 66 tokens: 65 characters plus a special [MASK] token for the diffusion process.

Despite only a few hours of training, the model already produces recognizable outputs. Given the prompt "to be," it generated: "To be, fo hend! First her sense ountier to Jupits, be horse." While far from coherent, the output clearly captures Shakespearean structure and rhythm. The developer acknowledges the model is undertrained but emphasizes the educational value—building it helped demystify concepts like discrete diffusion, encoders, decoders, and tokenizers. The full code is open-source on GitHub for anyone wanting to experiment.

Key Points
  • 7.5M parameter discrete diffusion language model built entirely without AI code generation tools
  • Trained for a few hours on a MacBook Air M2 using Karpathy's tiny Shakespeare dataset
  • Vocabulary of 66 tokens (65 characters + [MASK]), generating recognizable Shakespearean text from minimal training

Why It Matters

Proves diffusion language models are accessible to individual developers, not just big labs.