DODO: Discrete OCR Diffusion Models
Researchers introduce a block discrete diffusion VLM that processes entire documents at once, not token-by-token.
Researchers from Tel Aviv University and NVIDIA introduce DODO, the first Vision-Language Model using block discrete diffusion for Optical Character Recognition. Unlike autoregressive models that process tokens sequentially, DODO enables parallel decoding, achieving near state-of-the-art accuracy while enabling up to 3x faster inference. This allows for rapid digitization of long documents, overcoming the computational bottleneck of traditional methods that require a forward pass for every generated token.
Why It Matters
Faster, cheaper OCR enables large-scale digitization of archives, legal documents, and historical texts previously too slow to process.