Research & Papers

[R] Prisma: Interpretability-Inspired Mirrored Transformer Architecture

New 357M-parameter architecture achieves competitive results after seeing just 30B tokens.

Deep Dive

An independent researcher known as Prisma has unveiled a novel transformer architecture that challenges conventional scaling laws by achieving competitive performance with significantly less training data. The model, simply called 'Prisma,' outperforms OpenAI's GPT-2 Medium on 5 out of 8 standard benchmarks despite being trained on only 30 billion tokens—a fraction of the data used by comparable models. This breakthrough suggests that architectural innovation, rather than sheer scale, can yield substantial efficiency gains. The work was inspired by interpretability research and a conceptual model of transformers as 'lenses with microfractures' that process information through mirrored pathways.

The architecture introduces two key innovations: G²LU (Gated² Linear Unit), a novel activation function that adds a fourth weight set to the standard feed-forward network, and WoRPE (Word-Position RoPE), which enhances positional encoding. With 357 million parameters arranged in 41 layers, Prisma demonstrates that careful structural design can reduce computational demands. The model was trained on a mix of OpenWebText and FineWeb-Edu datasets using a single H100 GPU, showing that competitive language modeling doesn't always require massive clusters. While still experimental, this approach could inspire more efficient model designs across the industry.

Key Points
  • Beats GPT-2 Medium on 5/8 benchmarks with 25% less training data (30B tokens vs typical 200B+)
  • Introduces G²LU activation and WoRPE positional encoding in a 41-layer, 357M-parameter mirrored architecture
  • Trained on a single H100 GPU, demonstrating accessible pathways to competitive model development

Why It Matters

Shows architectural innovation can reduce AI training costs and energy use while maintaining performance, opening doors for smaller teams.