Research & Papers

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

A 371M-parameter model uses spectral memory to rival attention's expressiveness...

Deep Dive

Maixent Chenebaux introduced Nautile-370M, a 371-million-parameter small language model that combines a novel spectral memory operator, SeqCond Attention (SCA), with traditional transformer layers. The architecture alternates two SCA layers—linear-time and inspired by SeqCondenser—with one transformer layer, aiming to capture long-context efficiency and state tracking while preserving the token-to-token routing of softmax attention. The model was trained on a single Google Cloud TPU v4-64 pod via the TPU Research Cloud program, with a reinforcement learning stage on an NVIDIA DGX Spark.

Key theoretical contributions include a proof that SCA's readout mechanism can exactly retrieve any individual token from the prefix summary and reproduce any output of softmax attention as a special case, establishing SCA's expressiveness equivalence to full self-attention in the continuous limit. The paper details a training data pipeline and a specialized RL stage for reasoning, verification, and response quality. Nautile-370M demonstrates that small, efficient models can match attention's capabilities using spectral memory, opening doors for reasoning under strict parameter and inference budgets.

Key Points
  • Nautile-370M uses a hybrid of two SeqCond Attention (SCA) layers and one transformer layer for efficient reasoning.
  • Proved SCA can exactly retrieve any token from the prefix summary and reproduce softmax attention as a special case.
  • Trained on a single Google TPU v4-64 pod, with RL on an NVIDIA DGX Spark for reasoning and verification.

Why It Matters

Shows small models can match attention's power with spectral memory, enabling efficient reasoning on limited hardware.