[P] FP8 inference on Ampere without native hardware support | TinyLlama running on RTX 3050
New Triton kernels emulate FP8 precision on Ampere/Turing GPUs, achieving 50% faster inference without native hardware.
Feather, a new optimization framework, has demonstrated that high-performance FP8 inference isn't exclusive to NVIDIA's latest H100 GPUs. By developing custom Triton kernels with bit-packing techniques, the project successfully emulates FP8 precision in software, specifically targeting older architectures like Ampere, Turing, and Volta. The key innovation lies in optimizing for memory bandwidth—often the bottleneck in these systems—rather than relying on native hardware support. Initial results show TinyLlama-1.1B running 1.5x faster on a consumer-grade RTX 3050 compared to standard Hugging Face FP32 implementations, with only minimal accuracy degradation. This breakthrough was significant enough to earn acceptance at the PyTorch Conference Europe 2026, where it will be presented in Paris this April.
The technical approach uses software emulation to pack FP8 data formats, making efficient use of available memory bandwidth on GPUs that lack dedicated FP8 units. While the current kernels are described as "pretty naive," the roadmap includes substantial optimizations: integrating CUDA Graphs for reduced launch overhead, implementing block-level quantization schemes, and expanding support to the Llama-2 and Llama-3 model families. The developer openly seeks collaboration, particularly from experts in CUDA Graphs and dynamic quantization, to benchmark against established engines like vLLM. This work directly challenges the industry's rapid hardware obsolescence, offering a practical path to extend the usable life of existing data center and consumer GPUs for modern AI workloads.
- Software-emulated FP8 achieves 1.5x faster TinyLlama inference on RTX 3050 vs. FP32
- Uses custom Triton kernels with bit-packing to optimize memory bandwidth on Ampere/Turing GPUs
- Accepted at PyTorch Conference Europe 2026, with plans for CUDA Graphs and Llama family support
Why It Matters
Extends the lifespan of existing GPU fleets for cost-effective AI inference, reducing dependency on latest hardware.