Open Source

Qwen Introduced FlashQLA

r/LocalLLaMA April 29, 2026

⚡New kernel tech makes agentic AI 2x faster on personal devices.

Deep Dive

Qwen has released FlashQLA, a set of high-performance linear attention kernels built on TileLang that achieve 2–3× forward and 2× backward speedups. The kernels are purpose-built for agentic AI on personal devices, leveraging gate-driven automatic intra-card compute parallelism (CP) and hardware-friendly algebraic reformulation to boost streaming multiprocessor (SM) utilization. FlashQLA splits the GDN flow into two optimized kernels rather than fusing everything into one, which incurs extra memory I/O at large batch sizes but delivers superior real-world performance on edge devices and long-context workloads.

Gains are especially pronounced for TP (tensor parallelism) setups, small models, and long-context tasks. The backward pass was the most challenging component, requiring a 16-stage warp-specialized pipeline under tight on-chip memory constraints to achieve 2×+ kernel-level speedups. The project is open-source, with code available on GitHub and detailed technical insights in Qwen's blog post. This development aims to make advanced AI inference more efficient on consumer hardware.

Key Points

FlashQLA delivers 2–3× forward and 2× backward speedups for linear attention.
Uses gate-driven intra-card CP and TileLang fused warp-specialized kernels.
Best for small models, long-context workloads, and edge devices with TP setups.

Why It Matters

Enables efficient agentic AI on personal devices, reducing latency and hardware requirements.

Read Original Article

Qwen Introduced FlashQLA

Why It Matters

Stay Ahead in AI