Research & Papers

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

arXiv cs.LG April 29, 2026

⚡A 371M-parameter model uses spectral memory to rival attention's expressiveness...

Deep Dive

Maixent Chenebaux introduced Nautile-370M, a 371-million-parameter small language model that combines a novel spectral memory operator, SeqCond Attention (SCA), with traditional transformer layers. The architecture alternates two SCA layers—linear-time and inspired by SeqCondenser—with one transformer layer, aiming to capture long-context efficiency and state tracking while preserving the token-to-token routing of softmax attention. The model was trained on a single Google Cloud TPU v4-64 pod via the TPU Research Cloud program, with a reinforcement learning stage on an NVIDIA DGX Spark.

Key theoretical contributions include a proof that SCA's readout mechanism can exactly retrieve any individual token from the prefix summary and reproduce any output of softmax attention as a special case, establishing SCA's expressiveness equivalence to full self-attention in the continuous limit. The paper details a training data pipeline and a specialized RL stage for reasoning, verification, and response quality. Nautile-370M demonstrates that small, efficient models can match attention's capabilities using spectral memory, opening doors for reasoning under strict parameter and inference budgets.

Key Points

Nautile-370M uses a hybrid of two SeqCond Attention (SCA) layers and one transformer layer for efficient reasoning.
Proved SCA can exactly retrieve any token from the prefix summary and reproduce softmax attention as a special case.
Trained on a single Google TPU v4-64 pod, with RL on an NVIDIA DGX Spark for reasoning and verification.

Why It Matters

Shows small models can match attention's power with spectral memory, enabling efficient reasoning on limited hardware.

Read Original Article

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

Why It Matters

Stay Ahead in AI