Open Source

[Release] AdaLLM: NVFP4-first inference on RTX 4090 (FP8 KV cache + custom FP8 decode)

A new runtime makes NVIDIA's latest 4-bit format usable on consumer GPUs today.

Deep Dive

AdaLLM is a new, specialized inference runtime that enables NVFP4 4-bit weights to run on Ada Lovelace GPUs like the RTX 4090. It features a pure FP8 pipeline with a custom decode kernel and FP8 KV cache, avoiding silent fallbacks to FP16. Benchmarks show a Qwen3-8B model uses 2.4x less peak VRAM compared to an FP16 baseline, with a 20-25% throughput trade-off, achieving up to ~470 tokens/sec at batch size 16.

Why It Matters

This lets developers run larger, more advanced models locally on high-end consumer hardware, democratizing access to cutting-edge inference.