Research & Papers

DODO: Discrete OCR Diffusion Models

Researchers introduce a block discrete diffusion VLM that processes entire documents at once, not token-by-token.

Deep Dive

Researchers from Tel Aviv University and NVIDIA introduce DODO, the first Vision-Language Model using block discrete diffusion for Optical Character Recognition. Unlike autoregressive models that process tokens sequentially, DODO enables parallel decoding, achieving near state-of-the-art accuracy while enabling up to 3x faster inference. This allows for rapid digitization of long documents, overcoming the computational bottleneck of traditional methods that require a forward pass for every generated token.

Why It Matters

Faster, cheaper OCR enables large-scale digitization of archives, legal documents, and historical texts previously too slow to process.