Research & Papers

VDCook:DIY video data cook your MLLMs

arXiv cs.LG March 09, 2026

⚡Researchers can now 'cook' specialized video training data for MLLMs using simple text prompts and adjustable parameters.

Deep Dive

Chengwei Wu's VDCook platform is a novel infrastructure-level solution for constructing specialized video training datasets for Multimodal Large Language Models (MLLMs). Unlike static, one-time-built datasets, VDCook functions as a configurable, self-evolving "data operating system." Researchers and domain teams initiate the process by submitting natural language queries (e.g., "videos of a chef preparing sushi") and adjusting parameters like desired data scale, the ratio of retrieved-to-synthesized content, and quality thresholds. The system then automatically optimizes the query and concurrently runs its core modules: real video retrieval from existing sources and controlled synthesis of new video content.

The output is a comprehensive, in-domain data package complete with full provenance metadata and reproducible Jupyter Notebooks, enabling rigorous scientific validation. A key innovation is its automated data ingestion mechanism based on Anthropic's Model Context Protocol (MCP), which allows datasets to continuously update and expand into dynamically evolving ecosystems. The platform also automatically annotates videos with multi-dimensional metadata, including scene segmentation, motion scoring, OCR text detection, and automatic captioning. This rich annotation lays the groundwork for flexible downstream data manipulation ("cooking") and efficient indexing, directly supporting the training pipelines of modern video-language models. By automating the labor-intensive data curation process, VDCook aims to significantly accelerate research and development in specialized verticals like medicine, robotics, or education, where high-quality, domain-specific video data is scarce but critical.

Key Points

Users build datasets via natural language queries and configurable parameters for scale, synthesis ratio, and quality.
System automatically retrieves real videos and performs controlled synthesis, outputting complete packages with provenance and notebooks.
Leverages Anthropic's Model Context Protocol (MCP) for continuous, automated updates, transforming static datasets into living ecosystems.

Why It Matters

Drastically reduces the time and expertise needed to create high-quality, specialized video datasets for training next-gen AI models.

Read Original Article

VDCook:DIY video data cook your MLLMs

Why It Matters

Stay Ahead in AI