Research & Papers

ScrapeGraphAI-100k dataset trains small 1.7B models to match 30B LLMs on web extraction

arXiv cs.IR February 18, 2026

⚡A new 93,695-example dataset from real-world LLM usage closes the gap between small and large models for structured data extraction.

Deep Dive

Researchers William Brach, Francesco Zuppichini, Marco Vinciguerra, and Lorenzo Padoan built ScrapeGraphAI-100k, a large-scale dataset for training LLMs on web information extraction. It contains 93,695 real-world examples from ScrapeGraphAI telemetry, each with Markdown content, a prompt, a JSON schema, and the LLM's response. The dataset enables fine-tuning of small, efficient models; a 1.7B parameter model trained on it performed nearly as well as a 30B parameter baseline on structured extraction tasks.

Why It Matters

This enables cheaper, faster, and more efficient AI agents for automating data collection from websites, reducing reliance on massive, expensive models.

Read Original Article

ScrapeGraphAI-100k dataset trains small 1.7B models to match 30B LLMs on web extraction

Why It Matters

Related Articles

🚀 Stay Ahead in AI