AdaLLM unlocks NVFP4 on RTX 4090 with 2.4x lower VRAM and FP8 kernels
A new runtime makes NVIDIA's latest 4-bit format usable on consumer GPUs today.
AdaLLM is a new, specialized inference runtime that enables NVFP4 4-bit weights to run on Ada Lovelace GPUs like the RTX 4090. It features a pure FP8 pipeline with a custom decode kernel and FP8 KV cache, avoiding silent fallbacks to FP16. Benchmarks show a Qwen3-8B model uses 2.4x less peak VRAM compared to an FP16 baseline, with a 20-25% throughput trade-off, achieving up to ~470 tokens/sec at batch size 16.
Why It Matters
This lets developers run larger, more advanced models locally on high-end consumer hardware, democratizing access to cutting-edge inference.