ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
A new 93,695-example dataset from real-world LLM usage closes the gap between small and large models for structured data extraction.
Researchers William Brach, Francesco Zuppichini, Marco Vinciguerra, and Lorenzo Padoan built ScrapeGraphAI-100k, a large-scale dataset for training LLMs on web information extraction. It contains 93,695 real-world examples from ScrapeGraphAI telemetry, each with Markdown content, a prompt, a JSON schema, and the LLM's response. The dataset enables fine-tuning of small, efficient models; a 1.7B parameter model trained on it performed nearly as well as a 30B parameter baseline on structured extraction tasks.
Why It Matters
This enables cheaper, faster, and more efficient AI agents for automating data collection from websites, reducing reliance on massive, expensive models.