Open Source

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

r/LocalLLaMA May 01, 2026

⚡Speculative prefill cuts first-token wait from 4 minutes to 25 seconds on a consumer GPU.

Deep Dive

Luce-Org released an open-source C++/CUDA speculative prefill system (repo: github.com/Luce-Org/lucebox-hub) that speeds up long-context LLM prefill by about 10×. On an RTX 3090 with Qwen3.6-27B at 128K tokens, time-to-first-token dropped from ~257 s to 24.8 s. It uses a small drafter to score token importance, prefilling only crucial spans. No Python or PyTorch in inference. Available under MIT on GitHub.

Key Points

10.4x faster TTFT than vanilla llama.cpp at 128K tokens on RTX 3090 (24.8s vs 257s)
Open source MIT license on GitHub (Luce-Org/lucebox-hub), C++/CUDA only
Combines speculative prefill and block-sparse attention in a single process without Python or PyTorch

Why It Matters

Makes long-context LLM inference practical on consumer GPUs, enabling real-time apps with huge prompts.

Read Original Article

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Why It Matters

Stay Ahead in AI