Qwen Introduced FlashQLA
New kernel tech makes agentic AI 2x faster on personal devices.
Qwen has released FlashQLA, a set of high-performance linear attention kernels built on TileLang that achieve 2–3× forward and 2× backward speedups. The kernels are purpose-built for agentic AI on personal devices, leveraging gate-driven automatic intra-card compute parallelism (CP) and hardware-friendly algebraic reformulation to boost streaming multiprocessor (SM) utilization. FlashQLA splits the GDN flow into two optimized kernels rather than fusing everything into one, which incurs extra memory I/O at large batch sizes but delivers superior real-world performance on edge devices and long-context workloads.
Gains are especially pronounced for TP (tensor parallelism) setups, small models, and long-context tasks. The backward pass was the most challenging component, requiring a 16-stage warp-specialized pipeline under tight on-chip memory constraints to achieve 2×+ kernel-level speedups. The project is open-source, with code available on GitHub and detailed technical insights in Qwen's blog post. This development aims to make advanced AI inference more efficient on consumer hardware.
- FlashQLA delivers 2–3× forward and 2× backward speedups for linear attention.
- Uses gate-driven intra-card CP and TileLang fused warp-specialized kernels.
- Best for small models, long-context workloads, and edge devices with TP setups.
Why It Matters
Enables efficient agentic AI on personal devices, reducing latency and hardware requirements.