Image & Video

[ComfyUI] Accelerate Z-Image (S3-DiT) by 20-30% & save 3.5GB VRAM using Triton+INT8 (No extra model downloads)

Open-source optimization runs on existing BF16 models with no extra downloads, using Triton kernels and INT8 quantization.

Deep Dive

Independent developer newgrit1004 has released an open-source optimization for ComfyUI that significantly boosts performance for the Z-Image S3-DiT model. The custom node implements Triton kernel fusion combined with W8A8 INT8 quantization, delivering 20-30% faster inference speeds while reducing VRAM usage by approximately 3.5GB. Crucially, it operates directly on users' existing BF16 model files, eliminating the need to download separate quantized versions. Benchmarks on an RTX 5090 show text-to-image generation times dropping from 18.9 seconds to 15.3 seconds, with even greater improvements when using LoRAs.

The optimization includes six fused Triton kernels covering key operations like RMSNorm, SwiGLU, and RoPE, along with Hadamard Rotation techniques from recent research (QuaRot, NeurIPS 2024) to maintain image quality despite quantization. Developer testing shows only microscopic pixel differences while preserving overall visual composition and details. The node serves as a drop-in replacement in existing workflows, maintaining full compatibility with LoRAs and ControlNets, and is easily installable through ComfyUI Manager. The developer also announced upcoming work on similar optimizations for Qwen3-TTS, promising ~5x speedups for AI audio generation pipelines.

Key Points
  • Achieves 20-30% faster inference for Z-Image S3-DiT (6.15B) using Triton kernel fusion + INT8 quantization
  • Saves ~3.5GB VRAM (from 23GB to 19.5GB) while working with existing BF16 models
  • Drop-in compatible with LoRAs/ControlNets, available via ComfyUI Manager with no custom CUDA builds required

Why It Matters

Makes high-quality image generation more accessible by reducing hardware requirements and speeding up creative workflows.