Usual implementation of attention transformers (SDPA) is kind of bad, actually

262588213843476· May 18, 2026 · 4:23 AM UTC ·20 min read · 0 reactions · 0 comments · 14 views

#artificial intelligence #machine learning #technology

Usual implementation of attention transformers (SDPA) is kind of bad, actually

⚡ TL;DR · AI summary

The article critiques the standard transformer architecture (SDPA) used in machine learning, arguing that it may not be as effective as commonly believed. The author suggests that large AI companies promote expensive models to maintain their competitive advantage. While not dismissing SDPA entirely, the piece raises questions about its necessity and hints at the potential for better alternatives in the future.

Key facts

▪The author believes that big AI companies shape the industry to favor expensive models due to their competitive advantages.
▪The article discusses the inefficiencies of the standard transformer architecture (SDPA) in machine learning.
▪It highlights the historical context of various machine learning models, including fully connected networks, recurrent networks, and convolutional networks.

Original article

Gist · 262588213843476

Read full at Gist →

Opening excerpt (first ~120 words) tap to expand

Introduction I was writing a note to a friend that mentioned my tedious opinions on “AI” discourse. It veered off into my usual argument that big “AI” companies are shaping the industry ecosystem to their own ends by setting up a situation where expensive-to-run models are overvalued. I think they’re doing this because they have a competitive advantage in that tier of the market, having bought (time on) a lot of GPUs. It’s like how a company that owns diamond mines will probably promote the idea that large, mined diamonds are important and valuable, and that there’s something off about running a sub-industrial mine or lab-growing diamonds. You can do this without lying at all, but I still dislike it. Large mined diamonds here are $O(n^2)$ models.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Gist.

Anonymous · no account needed

Discussion

0 comments

Usual implementation of attention transformers (SDPA) is kind of bad, actually

Discussion

More from Gist