Open Source

Here's how my LLM's decoder block changed while training on 5B tokens

An experimental 18M parameter model using spline manifolds shows promising loss reduction after 5B tokens of training.

Deep Dive

Independent researcher Curvedinf is conducting a novel experiment in AI architecture, replacing core components of the standard transformer model with a custom geometric design. The project, based on the K-Splanifolds paper, substitutes the traditional MLP (multilayer perceptron) decoder blocks with discrete, lower-dimensional spline manifolds. This 18-million-parameter model is being trained on a text corpus, and after processing 5 billion tokens, it is showing a promisingly decreasing loss curve, suggesting the unconventional architecture can effectively learn language patterns. The researcher is visually tracking the development of specific layers, like layer 96 out of 128, to observe how the spline geometry evolves during training.

The work represents a significant departure from the dense, high-dimensional neural networks that dominate modern LLMs like GPT-4 and Llama 3. By using spline manifolds—mathematical constructs that can represent complex, curved surfaces with piecewise polynomials—the approach aims for a more efficient and potentially more interpretable geometric representation of data. The researcher has shared the source code and interactive demos that allow others to visualize how these 2D and 3D splanifolds operate. While still experimental, the project's early success challenges assumptions about necessary model scale and complexity, opening a research path toward more efficient, geometrically-inspired AI architectures that could reduce computational costs.

Key Points
  • Architecture replaces transformer MLP blocks with discrete spline manifolds, a major structural change.
  • The compact 18M parameter model shows decreasing loss after training on 5B tokens.
  • Researcher is visually monitoring the geometric development of layer 96/128 during ongoing training.

Why It Matters

Challenges transformer orthodoxy with efficient geometric representations, potentially leading to smaller, faster, and more interpretable future models.