Research & Papers

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

r/MachineLearning April 20, 2026

⚡Independent researcher builds multi-stage AI pipeline that generates high-quality visual question-answer data for training vision-language models.

Deep Dive

Independent researcher Dreeseaw has released SGOCR, an open-source dataset pipeline designed to generate spatially-grounded, OCR-focused visual question-answer tuples for training vision-language models. The project addresses a critical gap in current visual datasets, which typically teach models to reason about text or scenes rather than simply grounding text within imagery. The multi-stage pipeline evolved from a two-week side project that began with prompting Qwen2.5-VL locally and grew into a sophisticated system using multiple AI models for different processing stages.

SGOCR employs Nvidia's nemotron-ocr-v2 for text extraction, a combination of Gemma4 with Qwen3-VL fallback for anchor discovery and labeling, and Gemini 2.5 Flash as a teacher model with simple grounding checks for verification. The researcher discovered that using the smaller 2.5 Flash teacher model was effective due to the highly grounded annotations provided in context, allowing the model to focus on semantics. The development process utilized an agentic loop with a custom optimization approach based on Karpathy's autoresearch, implementing a sweep-based process that enables better holistic observation and reduces the risk of good ideas being discarded prematurely.

The pipeline includes a dataset review frontend that stores human-grounded context for quality assessment, which was then bootstrapped into an automated quality scoring system. This approach allows for continuous improvement of the generated data quality while maintaining efficiency. The resulting dataset provides rich metadata to support diverse VLM training strategies, offering researchers and developers a valuable resource for building more accurate text-grounding capabilities in vision-language models.

Key Points

Uses Nvidia's nemotron-ocr-v2 for text extraction and Gemma4/Qwen3-VL for anchor discovery
Implements agentic development loop with custom optimization based on Karpathy's autoresearch
Generates spatially-grounded OCR training data with rich metadata for diverse VLM strategies

Why It Matters

Provides high-quality training data for improving text-grounding in vision-language models, addressing a critical gap in current AI training resources.

Read Original Article

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Why It Matters

Stay Ahead in AI