Open Source

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Speculative prefill cuts first-token wait from 4 minutes to 25 seconds on a consumer GPU.

Deep Dive

Luce-Org released an open-source C++/CUDA speculative prefill system (repo: github.com/Luce-Org/lucebox-hub) that speeds up long-context LLM prefill by about 10×. On an RTX 3090 with Qwen3.6-27B at 128K tokens, time-to-first-token dropped from ~257 s to 24.8 s. It uses a small drafter to score token importance, prefilling only crucial spans. No Python or PyTorch in inference. Available under MIT on GitHub.

Key Points
  • 10.4x faster TTFT than vanilla llama.cpp at 128K tokens on RTX 3090 (24.8s vs 257s)
  • Open source MIT license on GitHub (Luce-Org/lucebox-hub), C++/CUDA only
  • Combines speculative prefill and block-sparse attention in a single process without Python or PyTorch

Why It Matters

Makes long-context LLM inference practical on consumer GPUs, enabling real-time apps with huge prompts.