The Routing and Filtering Structure of Attention

May 20, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 24 views

#machine learning #artificial intelligence #attention mechanisms

TL;DR · WeSearch summary

The article discusses a new approach to understanding attention mechanisms in machine learning models. It introduces the concept of $S$-$D$ attention, which separates routing from filtering, allowing for more stable training. The findings suggest that this decomposition can lead to more efficient architectures with fewer parameters and improved performance.

Key facts

▪The attention interaction matrix contains both routing and filtering components.
▪The study decomposes 1776 heads across five pretrained transformers to analyze routing and filtering.
▪The proposed $S$-$D$ attention allows for stable training without layer normalization.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.18826 (cs) [Submitted on 12 May 2026] Title:The Routing and Filtering Structure of Attention Authors:Shafayeth Jamil, Rehan Kapadia View a PDF of the paper titled The Routing and Filtering Structure of Attention, by Shafayeth Jamil and 1 other authors View PDF HTML (experimental) Abstract:The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

The Routing and Filtering Structure of Attention

Discussion

More from arXiv cs.AI