Open Source

(Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out

A simple tweak to the --ubatch-size parameter in llama.cpp made prompt processing 10x faster for a 27B model.

Deep Dive

A developer has shared a breakthrough optimization for running large language models locally using the popular llama.cpp framework. While experimenting with the Qwen3.5-27B model on an AMD Radeon RX 9070 XT GPU, they struggled with slow prompt processing speeds. Through trial and error, they discovered that the key was the `--ubatch-size` parameter, which controls how much data is processed in a single batch. The default value of 512 was causing major bottlenecks.

By adjusting the `--ubatch-size` to 64—matching their GPU's 64MB L3 cache—they witnessed a dramatic performance leap. Benchmark results showed prompt processing speed (pp512) jumping from 83 tokens/second at ubatch-size 8 to a staggering 582 tokens/second at ubatch-size 64, a nearly 10x improvement. This tweak made the 27-billion-parameter model usable for interactive tasks like code generation. The finding highlights a crucial, often-overlooked setting that can make or break the local inference experience for developers and enthusiasts running state-of-the-art models on consumer hardware.

Key Points
  • Setting `--ubatch-size` to match GPU L3 cache (64MB) boosted Qwen3.5-27B prompt processing from ~83 t/s to ~582 t/s.
  • The default `--ubatch-size` of 512 caused severe performance issues, making the 27B model nearly unusable for interactive tasks.
  • The optimization was found on Windows 11 using the llama.cpp ROCm backend with AMD's latest 26.2.2 drivers.

Why It Matters

This simple tweak unlocks practical local use of large 27B+ models, making advanced AI more accessible to developers without enterprise hardware.