Open Source

llama-bench ROCm 7.2 on Strix Halo (Ryzen AI Max+ 395) — Qwen 3.5 Model Family

New benchmarks show AMD's Ryzen AI Max+ 395 with 128GB unified memory running massive AI models locally.

Deep Dive

New benchmarks reveal the raw local AI inference power of AMD's upcoming Strix Halo platform. Running on a Ryzen AI Max+ 395 APU with 128GB of unified memory and using the ROCm 7.2 software stack, the system demonstrated it can handle massive language models that were previously restricted to data centers. The most impressive result shows the 122-billion-parameter Qwen3.5-122B model running at 21 tokens per second for generation—making it potentially usable for interactive applications on what is essentially integrated graphics.

The benchmarks, conducted by Reddit user u/przbadu using the llama.cpp framework and Unsloth quantized models, highlight several key advancements. Mixture-of-Experts (MoE) models like the 35B-parameter Qwen3.5-35B-A3B showed exceptional efficiency, achieving 887 tokens/sec on prompt processing by activating only 3 billion parameters at a time. The tests also compared ROCm performance against Vulkan, providing a comprehensive look at the software ecosystem for AMD's AI hardware. This data points to a future where high-end consumer PCs can serve as capable local AI workstations.

Key Points
  • AMD's Strix Halo APU ran a 122-billion-parameter Qwen3.5 model at 21 tokens/sec generation using integrated Radeon 8060S graphics
  • The 128GB unified memory architecture allowed massive models like GPT-OSS-120B (60GB quantized) to run entirely in memory without swapping
  • Mixture-of-Experts models showed major efficiency gains, with Qwen3.5-35B-A3B achieving 887 tokens/sec prompt processing by activating only 3B parameters

Why It Matters

Enables running enterprise-scale AI models on consumer hardware, reducing cloud dependency and latency for developers and power users.