FCP shards sequences at block-level granularity, enabling flexible placement across GPUs rather than relying on rigid ring communication?

FCP shards sequences at block-level granularity, enabling flexible placement across GPUs rather than relying on rigid ring communication.

Achieves near-linear scalability on up to 256 NVIDIA GPUs with 1.13x–2.21x improvement in attention MFU?

Achieves near-linear scalability on up to 256 NVIDIA GPUs with 1.13x–2.21x improvement in attention MFU.

Bin-packs blocks from both short and long sequences to avoid over-sharding and workload imbalance common in existing CP methods?

Bin-packs blocks from both short and long sequences to avoid over-sharding and workload imbalance common in existing CP methods.

Research & Papers

FCP: New context parallelism boosts LLM pre-training by up to 2.21x

arXiv cs.DC May 12, 2026

⚡FCP's block-level sharding handles variable sequence lengths for near-linear GPU scalability.

Deep Dive

A team of researchers led by Yilong Zhao and Xiaonan Nie (affiliated with UC Berkeley, NVIDIA, and others) has introduced FCP (Flexible Context Parallelism), a new method for scaling foundation model pre-training across many GPUs. The paper, accepted at MLSys 2026, addresses a key bottleneck: existing context parallelism (CP) techniques struggle with the highly variable sequence lengths found in training data. Traditional CP methods either over-shard short sequences (wasting compute and creating communication overhead) or process long and short sequences in separate groups, causing workload imbalance.

FCP breaks sequences into smaller blocks and shards these blocks across workers via flexible peer-to-peer communication—moving beyond rigid ring-based topologies. This allows effective bin-packing of blocks from both short and long sequences onto available GPUs, ensuring high compute utilization and balanced loads. In tests on up to 256 NVIDIA GPUs, FCP achieved near-linear scaling and delivered 1.13x to 2.21x improvements in attention model flop utilization (MFU). The approach is particularly relevant for training long-context foundation models, where sequence length variation is extreme.

Key Points

FCP shards sequences at block-level granularity, enabling flexible placement across GPUs rather than relying on rigid ring communication.
Achieves near-linear scalability on up to 256 NVIDIA GPUs with 1.13x–2.21x improvement in attention MFU.
Bin-packs blocks from both short and long sequences to avoid over-sharding and workload imbalance common in existing CP methods.

Why It Matters

FCP makes long-context foundation model training more efficient, cutting GPU hours and enabling larger context windows.

Read Original Article

FCP: New context parallelism boosts LLM pre-training by up to 2.21x

Why It Matters

Related Articles

🚀 Stay Ahead in AI