Research & Papers

WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

The new AI planning domain can automatically build and schedule complex 14-step workflows across eight sites.

Deep Dive

Researchers Taylor Paul and William Regli have introduced WORKSWORLD, a novel AI planning framework designed to automate the complex task of building and scheduling distributed data pipelines. Published in the Proceedings of ICAPS 2026, the work addresses a critical bottleneck in data engineering: manually designing efficient workflows that process and move data across multiple computational sites. WORKSWORLD provides a formal domain for numeric, domain-independent planners, allowing them to solve a joint planning and scheduling problem. Users specify available data sources, processing components, network interfaces, and desired data destinations without needing to manually map out the entire workflow graph.

Empirical results demonstrate the framework's capability. Using a state-of-the-art numeric planner running on commodity hardware with a one-hour time limit and 30GB of memory, WORKSWORLD successfully solved linear-chain workflows comprising up to 14 distinct components distributed across eight different sites. This represents a significant step toward fully automated pipeline orchestration for permanent data ingest and processing systems. The framework's general graph representation, which includes both data processing nodes and the resource network for scheduling, makes it a powerful tool for optimizing data flow in complex, distributed computing environments where efficiency and resource allocation are paramount.

Key Points
  • Automates joint planning and scheduling for distributed data workflows, eliminating manual design.
  • Solved test workflows with 14 components across 8 sites using 30GB RAM and one hour of CPU time.
  • Allows users to define endpoints and resources; the AI constructs the optimal workflow graph and schedule.

Why It Matters

This automates a major engineering bottleneck, enabling more complex and efficient data pipelines for science and industry.