Achieves up to 2718 tok/s prefill on 7900 XTX (512 context) – 10% faster than llama.cpp HIP?

Achieves up to 2718 tok/s prefill on 7900 XTX (512 context) – 10% faster than llama.cpp HIP.

Near-lossless INT8 KV cache enables full 256K context in <24 GB, with peak memory as low as 19.8 GiB?

Near-lossless INT8 KV cache enables full 256K context in <24 GB, with peak memory as low as 19.8 GiB.

Supports Qwen 3.6 MoE/dense, ParoQuant quantization, GGUF, and runs on RDNA3 (gfx1100, gfx1151)?

Supports Qwen 3.6 MoE/dense, ParoQuant quantization, GGUF, and runs on RDNA3 (gfx1100, gfx1151).

Open Source

hipEngine: AMD RDNA3 LLM Engine Outperforms llama.cpp on Qwen 3.6

Q: Supports Qwen 3.6 MoE/dense, ParoQuant quantization, GGUF, and runs on RDNA3 (gfx1100, gfx1151)?

Supports Qwen 3.6 MoE/dense, ParoQuant quantization, GGUF, and runs on RDNA3 (gfx1100, gfx1151).

r/LocalLLaMA May 25, 2026

⚡Open-source ROCm-native engine achieves 2718 tok/s prefill on 7900 XTX.

Deep Dive

A developer has launched hipEngine, a new open-source (AGPLv3) ROCm-native inference engine purpose-built for AMD RDNA3 GPUs. It targets local LLM inference with a Python frontend and heavy HIP/C++ kernels, leveraging AMD libraries like hipBLASLt and AOTriton. The initial release focuses on Qwen 3.6 (both MoE and dense variants) and already demonstrates competitive—and often superior—performance against industry-standard llama.cpp.

Benchmarks on a Radeon RX 7900 XTX (gfx1100) show hipEngine leading in prefill tokens per second across all tested context lengths from 512 to 128K. At 512/128 context, it hit 2718 tok/s, outpacing llama.cpp's HIP backend (2436 tok/s) by over 10%. Decode speeds are similar (~103 tok/s on 512/128), but hipEngine shines in memory efficiency: it uses just 22.1 GiB at 128K context with BF16 KV cache, and near-lossless INT8 compression drops that to under 20 GiB, allowing the full Qwen 3.6 256K context window on a single 24 GB GPU. On the integrated Strix Halo (Radeon 8060S), hipEngine also beats llama.cpp HIP on prefill and matches or exceeds decode performance.

Key features include ParoQuant 4.68bpw quantization for improved accuracy, initial GGUF support for broader model compatibility, and an optimized memory allocator that retains only essential KV cache. This makes hipEngine a compelling choice for AMD users seeking high-performance, memory-efficient local LLM inference without relying on CUDA-dependent backends.

Key Points

Achieves up to 2718 tok/s prefill on 7900 XTX (512 context) – 10% faster than llama.cpp HIP.
Near-lossless INT8 KV cache enables full 256K context in <24 GB, with peak memory as low as 19.8 GiB.
Supports Qwen 3.6 MoE/dense, ParoQuant quantization, GGUF, and runs on RDNA3 (gfx1100, gfx1151).

Why It Matters

Gives AMD GPU users a high-performance, open-source LLM inference engine competitive with CUDA-based solutions.

Read Original Article

hipEngine: AMD RDNA3 LLM Engine Outperforms llama.cpp on Qwen 3.6

Why It Matters

Related Articles

🚀 Stay Ahead in AI