Moe inference optimizations: 15% lower expert load by request reordering
Doubleword has introduced a method to optimize inference in Mixture-of-Expert models by reordering inputs. This technique can reduce expert loads by approximately 15%, leading to improved throughput without requiring changes to the model. The approach involves clustering similar prompts together to minimize the number of unique experts loaded during inference.
- ▪Reordering inputs in Mixture-of-Expert models can cut expert loads by about 15%.
- ▪Using a greedy algorithm, prompts can be arranged into batches that minimize the number of unique experts needed.
- ▪The trained model achieves significant savings in expert loads, even when evaluated on out-of-domain datasets.
Opening excerpt (first ~120 words) tap to expand
←May 15, 2026MoE expert co-activations: Reordering inputs yields easy throughput gains.Josh CowanMember of Technical Staff, DoublewordDoubleword's batch inference offering keeps costs down by keeping throughput high, something which isn't easily done given the architecture of popular Mixture-of-Expert models. While MoE's sparse expert weights make them quick to train, they also mean that at each layer of every forward each request in a batch typically requires different expert weights to be loaded. This makes inference severely memory-bandwidth bound and cuts throughput relative to dense models. However, by reordering inputs so that similar prompts batch together, we can overlap the experts needed and reduce the number of unique experts loaded per forward.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Doubleword.