The cut in the Mixture of Experts compute graph
The Mixture of Experts (MoE) architecture offers a way to increase model parameters without significantly increasing computational costs. However, a critical issue arises from the discrete routing decision, which prevents the router from receiving gradient signals for improvement. This limitation necessitates workarounds for load balancing and training the router effectively.
- ▪The MoE architecture allows for many more parameters with minimal additional compute requirements.
- ▪A discrete routing decision results in a cut in the compute graph, preventing the router from receiving gradient signals.
- ▪Workarounds are necessary to address the limitations of the routing mechanism and ensure effective training.
Opening excerpt (first ~120 words) tap to expand
Mixture of Experts looks like it's one of those few changes you can make to the architecture of a model that comes almost for free: many more parameters, barely any more compute. The forward pass is just a router, a softmax and a top-k. Then you train it and the loss won't move. The reason, and we should have known it was too good to be true, is a single cut in the compute graph. All the load balancing, capacity buffers and z-loss are workarounds for this one cut. This cut comes from the routing decision being discrete, which until you've thought about it might not raise any red flags. But a top-k over the softmax probabilities (an argmax in the k=1 case) is a step function with no gradient.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at idlemachines.