SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]
Independent researcher builds multi-stage AI pipeline that generates high-quality visual question-answer data for training vision-language models.
Independent researcher Dreeseaw has released SGOCR, an open-source dataset pipeline designed to generate spatially-grounded, OCR-focused visual question-answer tuples for training vision-language models. The project addresses a critical gap in current visual datasets, which typically teach models to reason about text or scenes rather than simply grounding text within imagery. The multi-stage pipeline evolved from a two-week side project that began with prompting Qwen2.5-VL locally and grew into a sophisticated system using multiple AI models for different processing stages.
SGOCR employs Nvidia's nemotron-ocr-v2 for text extraction, a combination of Gemma4 with Qwen3-VL fallback for anchor discovery and labeling, and Gemini 2.5 Flash as a teacher model with simple grounding checks for verification. The researcher discovered that using the smaller 2.5 Flash teacher model was effective due to the highly grounded annotations provided in context, allowing the model to focus on semantics. The development process utilized an agentic loop with a custom optimization approach based on Karpathy's autoresearch, implementing a sweep-based process that enables better holistic observation and reduces the risk of good ideas being discarded prematurely.
The pipeline includes a dataset review frontend that stores human-grounded context for quality assessment, which was then bootstrapped into an automated quality scoring system. This approach allows for continuous improvement of the generated data quality while maintaining efficiency. The resulting dataset provides rich metadata to support diverse VLM training strategies, offering researchers and developers a valuable resource for building more accurate text-grounding capabilities in vision-language models.
- Uses Nvidia's nemotron-ocr-v2 for text extraction and Gemma4/Qwen3-VL for anchor discovery
- Implements agentic development loop with custom optimization based on Karpathy's autoresearch
- Generates spatially-grounded OCR training data with rich metadata for diverse VLM strategies
Why It Matters
Provides high-quality training data for improving text-grounding in vision-language models, addressing a critical gap in current AI training resources.