[P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book
Shows how to transform GPT-2 into Llama 3.2-3B by swapping just four key architectural components.
Developer S1LV3RJ1NX has released a comprehensive open-source educational project, including a book and accompanying code, that deconstructs and rebuilds leading large language models from the ground up. Titled 'Adventures with LLMs,' the resource provides a hands-on, PyTorch-based implementation of architectures like GPT-2, Meta's Llama 3, and DeepSeek. The core technical demonstration shows how the foundational GPT-2 architecture can be transformed into the more modern Llama 3.2-3B model through four precise swaps: replacing LayerNorm with RMSNorm, learned positional encodings with RoPE (Rotary Positional Embeddings), the GELU activation function with SwiGLU, and standard Multi-Head Attention with Grouped-Query Attention. Crucially, the code can then load Meta's official pre-trained weights, bridging the gap from educational implementation to a functional, state-of-the-art model.
The project goes further by implementing the sophisticated architecture of DeepSeek, covering advanced features like its MLA (Multi-head Latent Attention) with the 'absorption trick,' decoupled RoPE, and a MoE (Mixture of Experts) system with shared experts and fine-grained segmentation. It also details implementation strategies for Multi-Token Prediction and FP8 quantization. By providing fully open-source code on GitHub and a book with a free sample, S1LV3RJ1NX offers an unprecedented, practical look under the hood of these models. This moves beyond theoretical papers, giving engineers and students executable code to experiment with and learn from, effectively demystifying the incremental architectural advances that differentiate today's top-performing LLMs.
- Demonstrates converting GPT-2 to Llama 3.2-3B with just four component swaps: Norm, Position, Activation, and Attention.
- Implements DeepSeek's full architecture including its MoE system, Multi-Token Prediction, and FP8 quantization from scratch.
- Provides complete open-source PyTorch code and a book, offering a practical, code-first educational resource for AI practitioners.
Why It Matters
Demystifies cutting-edge AI model architectures for developers, enabling deeper understanding and fostering innovation through hands-on code.