Image & Video

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

New 0.25M-parameter Vision Transformer removes positional embeddings and class tokens for medical imaging.

Deep Dive

A new research paper introduces ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer specifically designed for medical imaging applications. The key innovation lies in removing two standard Vision Transformer components: positional embeddings and the [CLS] token, creating a permutation-invariant architecture through global average pooling over patch representations. This design addresses a critical limitation in medical imaging where spatial layout can be weakly informative or inconsistent across different imaging modalities and patient anatomies.

The technical approach uses adaptive residual projections to maintain training stability while operating under strict parameter constraints. The model was evaluated across seven MedMNIST datasets using a rigorous few-shot protocol with only 50 samples per class. Despite having just 0.25 million parameters and no pretraining, ZACH-ViT demonstrated competitive performance, achieving its strongest results on BloodMNIST while remaining competitive with TransMIL on PathMNIST. The research reveals regime-dependent behavior where the architecture performs best on datasets with weaker anatomical priors, supporting the hypothesis that aligning architectural inductive bias with data structure matters more than universal benchmark dominance.

For practical deployment, ZACH-ViT maintains sub-second inference times, making it suitable for edge-deployed clinical systems and resource-constrained environments. The findings challenge conventional wisdom about Vision Transformer design and suggest that medical imaging AI may benefit from specialized architectures rather than adapting general-purpose models. The code and models are publicly available, potentially accelerating development of lightweight diagnostic tools for healthcare settings with limited computational resources.

Key Points
  • Removes positional embeddings and [CLS] token for permutation invariance in medical imaging
  • Achieves competitive performance with only 0.25M parameters and no pretraining
  • Demonstrates sub-second inference times suitable for resource-constrained clinical deployment

Why It Matters

Enables accurate medical image analysis on edge devices in clinics with limited computational resources.