DynaFlow boosts ML throughput 1.29x with programmable operator scheduling
New framework decouples model logic from execution, slashing engineering costs.
Intra-device parallelism is a powerful but rarely adopted technique for improving GPU utilization in ML inference and training. The core idea—overlapping operators that use different resources (e.g., compute vs. memory)—can significantly boost throughput, but existing frameworks force developers to rewrite model code with invasive, model-specific optimizations. This creates an intractable engineering burden, especially as the optimal strategy shifts with workload, hardware, and architecture.
DynaFlow, presented at MLSys 2026 by a team including Yi Pan and Baris Kasikci, tackles this problem head-on. It introduces a frontend with simple graph-partitioning annotations and a programmable interface for defining custom parallelism strategies, completely decoupling the logical model from the physical execution schedule. The backend then handles complex asynchronous control/data-flow, custom memory management to eliminate copy overheads, and retains compatibility with standard optimizations like CUDA Graphs and TorchInductor. In tests across six modern ML systems, DynaFlow achieved up to a 1.29x throughput improvement with minimal code changes, making intra-device parallelism practical for production use. The framework is open-source on GitHub.
- DynaFlow introduces a programmable interface for custom intra-device parallelism strategies, reducing invasive code overhauls.
- Achieves up to 1.29x throughput improvement across 6 state-of-the-art ML systems with minimal model-specific changes.
- Preserves compatibility with existing optimizations like CUDA Graphs and TorchInductor, and uses custom memory management to eliminate copy overheads.
Why It Matters
DynaFlow makes intra-device parallelism practical for real-world ML systems, cutting engineering costs and boosting inference throughput.