DiTs' massive activations reveal sparse channels control image semantics
A few hidden channels in Diffusion Transformers carry the entire semantic load...
A new paper from researchers Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, and Marcella Cornia (University of Modena and Reggio Emilia) reveals that Diffusion Transformers (DiTs)—the backbone of modern text-to-image models like Stable Diffusion 3—use only a few hidden-state channels to control image semantics. These 'massive activations' are channels whose responses are consistently much larger than the rest. Despite their sparsity, they are functionally critical: a controlled disruption probe that zeros these channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect.
Second, massive activations are spatially organized. Restricting image-stream tokens to these channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. The authors demonstrate two use cases—text-conditioned and image-conditioned semantic transport—enabling prompt interpolation and subject-driven generation without any training. This reinterprets massive activations not as anomalies but as a sparse prompt-conditioned carrier subspace that organizes semantic information in DiTs.
- Massive activations are a small subset of DiT channels with responses consistently much larger than the rest.
- Zeroing massive channels collapses generation quality; low-statistic channels have negligible effect.
- Massive activations can be transferred between prompts to enable semantic interpolation and subject-driven generation without additional training.
Why It Matters
Enables training-free semantic control in DiTs, potentially simplifying prompt engineering and subject-driven generation in text-to-image models.