Open Source

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks.

New hybrid CPU/GPU system runs 80B parameter models on consumer hardware with breakthrough prefill speeds.

Deep Dive

A developer has unveiled Krasis, a novel hybrid CPU/GPU runtime engineered specifically for large Mixture-of-Experts models. The system fundamentally rethinks how inference is handled by dedicating the GPU's full power to the computationally intensive 'prefill' phase—processing the initial prompt—while offloading the subsequent token generation ('decode') to the CPU. This architecture, combined with aggressive use of system RAM as a staging area, allows models far exceeding GPU VRAM capacity to run at practical speeds. Benchmark results are striking: on a single consumer-grade RTX 5080, Krasis processed the 80-billion-parameter Qwen3-Coder-Next model at 3,324 tokens per second during prefill, with a Time-To-First-Token of just 9.7 seconds for a 35K token context.

The technical breakthrough lies in treating the GPU as a streaming compute engine, pushing model weights through VRAM as fast as possible and hiding data transfer latency under concurrent computation. This stands in contrast to standard runtimes that offload only a few layers to the GPU, resulting in a 'slow CPU slog' for most of the model. The trade-off is significantly higher system RAM usage—about 2.5x the quantized model size—but this is a favorable exchange given RAM's lower cost compared to VRAM. Currently supporting NVIDIA cards and models like Qwen3-Coder-Next and DeepSeek V2-Lite, Krasis demonstrates that specialized runtimes can dramatically improve accessibility to state-of-the-art, massive AI models without requiring datacenter-grade hardware.

Key Points
  • Achieved 3,324 tokens/second prefill speed for 80B Qwen3-Coder-Next model on a single RTX 5080 16GB GPU
  • Uses novel architecture: GPU handles full prefill pass, CPU handles decode, with system RAM as high-bandwidth staging area
  • Enables running models 2-5x larger than available VRAM by trading expensive VRAM for cheaper, abundant system RAM

Why It Matters

Democratizes access to massive coding/assistant models by making them runnable on consumer hardware, bypassing VRAM limitations.