Research & Papers

Measure Transport Theory Boosts Visual Text Compression by 3.3%

A label-free routing rule matches oracle on 17/24 NLP datasets.

Deep Dive

Visual text compression (VTC) offers a clever way to handle long contexts: render text as an image, then re-encode it with a vision-language model. This typically produces 3–20× fewer decoder tokens than standard subword tokenization. However, practitioners have struggled to predict when VTC actually improves downstream task performance — sometimes it matches or beats text-only baselines, other times it fails entirely. The compression ratio alone doesn't explain the variance, because what's missing is a principled measure of task-relevant information loss introduced by the visual encoding step.

Now, a team led by Lv Tang addresses this gap by formulating VTC in the language of measure transport. They treat text and visual tokens as empirical probability measures and show that the ViT patch encoder creates a push-forward map. Its transport cost decomposes into a precision cost (from within-patch aggregation) and a coverage cost (from cross-patch fragmentation). This yields two practical tools: a label-free routing criterion that selects the better path for each input, and a foveation mechanism that re-encodes high-cost regions at higher resolution. Tested across 24 NLP datasets with Qwen3-4B, the label-free rule matches the per-dataset oracle on 17/24 datasets (70.8%), improving average task score by +3.3% while saving 10.3% tokens vs. pure LLM baseline.

Key Points
  • Formulates visual text compression as measure transport, decomposing information loss into precision and coverage costs.
  • Label-free routing criterion matches oracle selection on 70.8% of 24 NLP datasets (17/24).
  • Achieves +3.3% average task score improvement with 10.3% fewer tokens than text-only LLM on Qwen3-4B.

Why It Matters

A principled way to decide when to use visual tokens vs. text tokens, cutting costs without sacrificing accuracy.