Open Source

(Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out

r/LocalLLaMA March 08, 2026

⚡A simple tweak to the --ubatch-size parameter in llama.cpp made prompt processing 10x faster for a 27B model.

Deep Dive

A developer has shared a breakthrough optimization for running large language models locally using the popular llama.cpp framework. While experimenting with the Qwen3.5-27B model on an AMD Radeon RX 9070 XT GPU, they struggled with slow prompt processing speeds. Through trial and error, they discovered that the key was the `--ubatch-size` parameter, which controls how much data is processed in a single batch. The default value of 512 was causing major bottlenecks.

By adjusting the `--ubatch-size` to 64—matching their GPU's 64MB L3 cache—they witnessed a dramatic performance leap. Benchmark results showed prompt processing speed (pp512) jumping from 83 tokens/second at ubatch-size 8 to a staggering 582 tokens/second at ubatch-size 64, a nearly 10x improvement. This tweak made the 27-billion-parameter model usable for interactive tasks like code generation. The finding highlights a crucial, often-overlooked setting that can make or break the local inference experience for developers and enthusiasts running state-of-the-art models on consumer hardware.

Key Points

Setting `--ubatch-size` to match GPU L3 cache (64MB) boosted Qwen3.5-27B prompt processing from ~83 t/s to ~582 t/s.
The default `--ubatch-size` of 512 caused severe performance issues, making the 27B model nearly unusable for interactive tasks.
The optimization was found on Windows 11 using the llama.cpp ROCm backend with AMD's latest 26.2.2 drivers.

Why It Matters

This simple tweak unlocks practical local use of large 27B+ models, making advanced AI more accessible to developers without enterprise hardware.

Read Original Article

(Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out

Why It Matters

Stay Ahead in AI