The Load-Balance Problem Behind Hybrid Parallelism
The article discusses the challenges of load balancing in hybrid parallelism systems, particularly focusing on the interaction between data parallelism (DP) and model parallelism (CP). It highlights how variable sequence lengths complicate scheduling and optimization in these systems. The piece also introduces solutions like Megatron Dynamic-CP and ByteScale's Hybrid Data Parallelism to improve efficiency in handling these complexities.
- ▪Variable sequence length complicates hybrid parallelism by turning it into a scheduling problem.
- ▪Megatron Dynamic-CP and ByteScale's Hybrid Data Parallelism offer solutions to improve load balancing.
- ▪The interaction between DP, CP, and pipeline parallelism (PP) cannot be optimized independently.
Opening excerpt (first ~120 words) tap to expand
systems note post-training systems DP+CP load balance The Load-Balance Problem Behind Hybrid Parallelism TL;DR: variable sequence length turns hybrid parallelism into a scheduling problem. Megatron Dynamic-CP improves a fixed DPxCP pool by choosing CP size per sequence inside a packed batch, while ByteScale's Hybrid Data Parallelism goes further by scheduling a more flexible rank pool. The main lesson is that DP, CP, and PP cannot be optimized independently once load balance and communication are both in the loop. 1. The 5D Map Is Really a Coupling Map DPSplits samples across ranks, then synchronizes gradients. CPSplits one sequence across ranks, mainly to make long attention fit. TP + SPSplits layer tensors and some sequence-dimension activations. PPSplits layers into pipeline stages.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.