TaTok: Adaptive image tokenization boosts speed 8.7x, accuracy 1.3x
New method uses global tokens and entropy-based filtering to eliminate redundancy in image tokenization.
Researchers from the Chinese Academy of Sciences have introduced TaTok, a theoretically grounded adaptive image tokenization framework that addresses two fundamental flaws in existing methods: information insufficiency when using only patch tokens and information redundancy among those patches. By modeling mutual information across patches with global tokens and applying a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy, TaTok eliminates redundant tokens while preserving critical information. The approach is inspired by information entropy theory, allowing the system to allocate more tokens to information-rich regions and fewer to homogeneous areas.
Experimental results demonstrate TaTok's state-of-the-art performance, delivering a 1.3x improvement in gFID (a key image quality metric) and an 8.7x speedup in inference compared to fixed-rate tokenizers. The framework is particularly impactful for long image sequence processing, where efficient tokenization is crucial. By adapting to the variable information density of images, TaTok avoids both over-compression and wasteful redundancies. This work provides valuable insights for future research in computer vision and generative AI, suggesting that adaptive token allocation could become a standard technique for image compression and representation learning.
- TaTok introduces global tokens to model mutual information across patch tokens, solving information insufficiency in reconstruction.
- Dynamic Token Filtering (DTF) uses cumulative conditional entropy to eliminate redundancy, achieving 8.7x inference speedup.
- The method improves image quality by 1.3x gFID compared to fixed-rate tokenization baselines.
Why It Matters
Adaptive tokenization reduces compute and storage costs while improving image quality, enabling more efficient vision models.