Open Source

Intel Optane PMem build runs 1 trillion parameter Kimi K2.5 at 4 tokens/sec

A single RTX 3060 + 768GB of secondhand Optane memory runs a trillion-parameter MoE model.

Deep Dive

A creative local AI inference build leverages Intel's discontinued Optane Persistent Memory (PMem) to run the 1 trillion parameter Kimi K2.5 model at a surprisingly usable ~4 tokens per second. The system uses six 128GB Optane DIMMs (768GB total) in Memory Mode, where PMem acts as main memory and traditional DDR4 DRAM functions as a cache. This setup drastically reduces memory costs compared to equivalent DRAM capacity, as the Optane sticks were purchased secondhand from Intel's discontinued line. The CPU is an Intel Xeon Gold 6246 with an ASUS RTX 3060 (12GB) handling attention and dense layers, while the bulk of the model's sparse expert weights reside on PMem/DRAM.

Using llama.cpp with hybrid GPU/CPU inference and flags like --override-tensor and --cmoe, the builder achieves ~4 tokens per second for generation on this 1T parameter Kimi K2.5, a mixture-of-experts model. The MoE architecture is ideal because only a fraction of the 1T total parameters are active per token, allowing the 12GB GPU to handle routing, attention, and shared experts while the sparse experts are processed from PMem. This demonstrates that with creative memory tiering and cheap secondhand hardware, even frontier-class models can be run locally. The builder notes that Optane's performance characteristics—faster than SSDs but slower than DRAM—make it uniquely suited for this inference pattern, and its discontinuation is a loss for the open-source AI community.

Key Points
  • Uses Intel Optane PMem (768GB) as main memory with DRAM cache, costing far less than equivalent DRAM on the secondhand market.
  • Runs Kimi K2.5 (1T parameters, MoE) via llama.cpp with hybrid CPU/GPU offloading: attention on RTX 3060, sparse experts on PMem.
  • Achieves ~4 tokens/sec generation—a breakthrough for local inference of trillion-parameter models on a hardware budget.

Why It Matters

Proves that cheap, discontinued memory tech can democratize local access to frontier-scale LLMs.