Auto-detects GPU (Blackwell for SA3, others for SA2) and selects best kernel (fp16, fp8, fp8++, triton) without user config?

Auto-detects GPU (Blackwell for SA3, others for SA2) and selects best kernel (fp16, fp8, fp8++, triton) without user config

Three attention modes?

SA2 for speed, SA3 for Blackwell, and dynamic Combine that switches per diffusion step for optimal quality

Supports models with local attention imports (Qwen, ErnieImage) via sys.modules patching; Qwen SA3 unstable above 7k sequence length?

Supports models with local attention imports (Qwen, ErnieImage) via sys.modules patching; Qwen SA3 unstable above 7k sequence length

Image & Video

SmartAttentionDispatcher brings SageAttention speed to ComfyUI without restart

r/StableDiffusion May 11, 2026

⚡ComfyUI node swaps PyTorch attention for SageAttention kernels, auto-detects GPU and model

Deep Dive

SmartAttentionDispatcher is a new ComfyUI node that replaces PyTorch's SDPA with SageAttention kernels (SA2 and SA3) to speed up attention computation during image generation. It runs as a plug-and-play patch after model loading and LoRA application, without requiring a ComfyUI restart or --use-sage-attention flags. The node automatically detects GPU architecture (e.g., RTX 50xx for SA3), installed libraries (sageattn, sageattn3), and model architecture, then selects the best kernel. Users see active mode, GPU tier, and kernel availability in the node status panel.

The node offers four modes: standard SDPA (no change), SA2 (SageAttention2 with kernels like fp16, fp8, triton), SA3 (SageAttention3 for Blackwell GPUs, CUDA 12.8+), and a Combine dynamic mode that uses SA2 for first/last steps and SA3 for middle steps. It patches most DiT models (Flux, SD3.5, Z-Image, LTX, Wan) via transformer_options, but also scans sys.modules for models that import attention locally (Qwen, ErnieImage, ACE-Step). Tested models show compatibility except Qwen in SA3 mode produces unstable outputs for sequences over 7000 tokens; SA2 works correctly. SDXL support exists but gains are minimal due to short sequences.

Key Points

Auto-detects GPU (Blackwell for SA3, others for SA2) and selects best kernel (fp16, fp8, fp8++, triton) without user config
Three attention modes: SA2 for speed, SA3 for Blackwell, and dynamic Combine that switches per diffusion step for optimal quality
Supports models with local attention imports (Qwen, ErnieImage) via sys.modules patching; Qwen SA3 unstable above 7k sequence length

Why It Matters

Faster image generation with smarter attention, no manual tuning needed for ComfyUI professionals.

Read Original Article

SmartAttentionDispatcher brings SageAttention speed to ComfyUI without restart

Why It Matters

Related Articles

🚀 Stay Ahead in AI