WeSearch

The cut in the Mixture of Experts compute graph

·15 min read · 0 reactions · 0 comments · 14 views
#machine learning#artificial intelligence#neural networks
The cut in the Mixture of Experts compute graph
⚡ TL;DR · AI summary

The Mixture of Experts (MoE) architecture offers a way to increase model parameters without significantly increasing computational costs. However, a critical issue arises from the discrete routing decision, which prevents the router from receiving gradient signals for improvement. This limitation necessitates workarounds for load balancing and training the router effectively.

Key facts
Original article
idlemachines
Read full at idlemachines →
Opening excerpt (first ~120 words) tap to expand

Mixture of Experts looks like it's one of those few changes you can make to the architecture of a model that comes almost for free: many more parameters, barely any more compute. The forward pass is just a router, a softmax and a top-k. Then you train it and the loss won't move. The reason, and we should have known it was too good to be true, is a single cut in the compute graph. All the load balancing, capacity buffers and z-loss are workarounds for this one cut. This cut comes from the routing decision being discrete, which until you've thought about it might not raise any red flags. But a top-k over the softmax probabilities (an argmax in the k=1 case) is a step function with no gradient.

Excerpt limited to ~120 words for fair-use compliance. The full article is at idlemachines.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from idlemachines