Research & Papers

Big Players Still Rely on Fixed-Patch ViTs for Vision-Language Models: Here's Why

Dynamic tokenization promises efficiency, but production systems stick with fixed patches due to pipeline constraints.

Deep Dive

A Reddit user wonders if major AI models use dynamic tokenization for vision instead of fixed patches, questioning whether the lack of adoption is due to marginal gains, fixed pipeline constraints, or unclear scaling laws—or if they are wrong and dynamic patching is already happening under the hood.

Key Points
  • Major production VLMs (GPT-4V, Gemini, Claude) still use fixed-patch ViTs despite research progress in dynamic tokenization.
  • Uniform token counts are required for pipeline efficiency, fixed memory allocation, and batch processing.
  • Scaling laws for input-adaptive patching remain unclear, discouraging large bets by risk-averse AI labs.

Why It Matters

The gap between research and production in vision tokenization highlights trade-offs between theoretical efficiency and practical system constraints.