Usual implementation of attention transformers (SDPA) is kind of bad, actually
The article critiques the standard transformer architecture (SDPA) used in machine learning, arguing that it may not be as effective as commonly believed. The author suggests that large AI companies promote expensive models to maintain their competitive advantage. While not dismissing SDPA entirely, the piece raises questions about its necessity and hints at the potential for better alternatives in the future.
- ▪The author believes that big AI companies shape the industry to favor expensive models due to their competitive advantages.
- ▪The article discusses the inefficiencies of the standard transformer architecture (SDPA) in machine learning.
- ▪It highlights the historical context of various machine learning models, including fully connected networks, recurrent networks, and convolutional networks.
Opening excerpt (first ~120 words) tap to expand
Introduction I was writing a note to a friend that mentioned my tedious opinions on “AI” discourse. It veered off into my usual argument that big “AI” companies are shaping the industry ecosystem to their own ends by setting up a situation where expensive-to-run models are overvalued. I think they’re doing this because they have a competitive advantage in that tier of the market, having bought (time on) a lot of GPUs. It’s like how a company that owns diamond mines will probably promote the idea that large, mined diamonds are important and valuable, and that there’s something off about running a sub-industrial mine or lab-growing diamonds. You can do this without lying at all, but I still dislike it. Large mined diamonds here are $O(n^2)$ models.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Gist.