Developer Tools

Introducing AutoSP

PyTorch Blog April 29, 2026

⚡Train 100k+ token models with zero manual code changes

Deep Dive

AutoSP, developed by SSAIL Lab at UIUC in collaboration with Anyscale and Snowflake, tackles the growing challenge of training LLMs on extremely long contexts exceeding 100k tokens. Traditional techniques like ZeRO/FSDP often hit out-of-memory errors at these scales, even with many GPUs. Sequence parallelism (SP) solves this by partitioning input tokens across devices, but implementing it manually requires invasive code changes to libraries like DeepSpeed or HuggingFace—partitioning tokens, inserting communication collectives, and overlapping compute with communication for both forward and backward passes. AutoSP automates this entirely through a compiler-based approach within DeepSpeed's DeepCompile ecosystem.

Users simply enable AutoSP by setting a few flags in their DeepSpeed config and using a utility function to tag inputs. The compiler handles all SP optimizations, including token partitioning, activation management, and communication scheduling, while interoperating with ZeRO stages 0/1. This makes long-context training accessible to any researcher without deep systems expertise. AutoSP is also performance-portable, generating efficient SP code across different hardware vendors. Benchmarks show it achieves comparable runtime to hand-optimized baselines while dramatically reducing engineering effort.

Key Points

AutoSP automatically converts single-device training code to multi-GPU sequence-parallel code for 100k+ token contexts
Integrated with DeepSpeed's DeepCompile compiler, requiring only config changes and input tagging
Compatible with ZeRO stages 0/1 and performance-portable across hardware vendors

Why It Matters

Democratizes long-context LLM research by eliminating weeks of manual systems engineering

Read Original Article

Introducing AutoSP

Why It Matters

Stay Ahead in AI