Kimi K2.5 model runs on RTX 3060 with Intel Optane
1 trillion parameters on a $300 GPU — at 4 tokens per second.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
In a surprising breakthrough, AI enthusiast APFrisco successfully demonstrated running Moonshot AI's massive 1-trillion-parameter Kimi K2.5 model on a single consumer-grade Nvidia RTX 3060 graphics card. The key enabler was 768GB of Intel Optane Persistent Memory, which provided the necessary capacity to hold the model's weights. Despite the RTX 3060's 12GB VRAM limitation, the setup achieved local inference at approximately four tokens per second—remarkable for non-enterprise hardware.
This feat highlights the potential of persistent memory to bridge the gap between consumer and enterprise AI workloads. While 4 tokens per second is too slow for real-time use, it's fast enough for offline experimentation, model evaluation, and fine-tuning research. The approach could democratize access to trillion-parameter models, traditionally reserved for server clusters with dozens of H100 GPUs. APFrisco's work opens the door for hobbyists and startups to explore frontier-scale models without breaking the bank.
- Ran Moonshot AI's 1-trillion-parameter Kimi K2.5 model on a single RTX 3060 GPU
- Used 768GB of Intel Optane Persistent Memory for weight storage
- Achieved ~4 tokens per second inference speed on consumer hardware
Why It Matters
Enables trillion-parameter model inference on consumer GPUs, drastically lowering barriers to AI research.