Hit 90.4% on LongMemEval-S with structured storage - no embeddings, ~half the tokens, 98% retrieval accuracy
98% retrieval accuracy using structured storage, half the tokens of embedding methods
A solo developer known as MontyOW has achieved a 90.4% score on the LongMemEval-S benchmark with c137, a structured memory system that deliberately avoids embeddings. The system uses a fixed 3-stage pipeline: retrieve, answer, and store. Stages 1 and 3 maintain maps of existing memory (topics, facts, ledgers) while staying lean, and Stage 2 only processes the relevant slice. This approach uses a median of 15k tokens per question (3k cached system, 2k user model, 8k dynamic, 2k tail) with no embeddings anywhere.
The developer, who built this during their first year of university, started with embeddings and centroid clustering but found it felt too much like a search engine. Agentic approaches with tool calling proved unreliable with weaker models. The key insight was that if you store correctly, retrieval becomes a 1-hop problem. Currently, 10 out of 500 questions lacked context to answer, and the remaining failures were due to model misuse of context. The project includes a public bench viewer (c137.ai/research/benchmark) where you can see all 500 questions sorted by category with pass/fail status, ground truth, and failures bucketed into model-fails vs retrieval-fails.
- 90.4% on LongMemEval-S with 98% retrieval accuracy using structured storage
- No embeddings used; 3-stage pipeline (retrieve, answer, store) with median 15k tokens per question
- Public bench viewer with 500 questions, pass/fail breakdown, and failure categorization
Why It Matters
Demonstrates structured memory can outperform embeddings for long-context AI tasks with half the tokens