The Routing and Filtering Structure of Attention
The article discusses a new approach to understanding attention mechanisms in machine learning models. It introduces the concept of $S$-$D$ attention, which separates routing from filtering, allowing for more stable training. The findings suggest that this decomposition can lead to more efficient architectures with fewer parameters and improved performance.
- ▪The attention interaction matrix contains both routing and filtering components.
- ▪The study decomposes 1776 heads across five pretrained transformers to analyze routing and filtering.
- ▪The proposed $S$-$D$ attention allows for stable training without layer normalization.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.18826 (cs) [Submitted on 12 May 2026] Title:The Routing and Filtering Structure of Attention Authors:Shafayeth Jamil, Rehan Kapadia View a PDF of the paper titled The Routing and Filtering Structure of Attention, by Shafayeth Jamil and 1 other authors View PDF HTML (experimental) Abstract:The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.