MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
New AI research from China rethinks OCR as inverse rendering, achieving parallel processing and 3.2x faster decoding.
A research team from China has published a groundbreaking paper titled "MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding." The work fundamentally challenges the standard approach to Optical Character Recognition (OCR), which has long relied on autoregressive decoding—a sequential, left-to-right generation process that introduces latency and error propagation in long documents. The researchers argue this sequential nature is an artifact of how we serialize text, not an intrinsic property of the visual recognition task itself. Their solution, MinerU-Diffusion, reframes OCR as an inverse rendering problem and employs a parallel diffusion denoising process under visual conditioning.
MinerU-Diffusion's technical core is a block-wise diffusion decoder paired with an uncertainty-driven curriculum learning strategy, enabling stable training and efficient inference for long sequences. This architecture allows the model to generate text and layout information in parallel rather than sequentially. Extensive experiments demonstrate that MinerU-Diffusion not only improves robustness but also achieves a significant 3.2x speedup in decoding compared to state-of-the-art autoregressive baselines. Furthermore, evaluations on their proposed "Semantic Shuffle" benchmark show the model has reduced dependence on linguistic priors, indicating stronger pure visual OCR capabilities for handling complex elements like tables and mathematical formulas.
- Replaces sequential autoregressive decoding with parallel diffusion denoising, achieving up to 3.2x faster inference.
- Frames document OCR as an inverse rendering problem, reducing error propagation in long, complex documents.
- Shows stronger visual recognition on the Semantic Shuffle benchmark, with less reliance on linguistic context.
Why It Matters
This could dramatically speed up and improve accuracy for processing legal documents, scientific papers, and financial reports at scale.