Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition
The paper discusses a method called Unpack for mechanistic interpretability of transformers. It focuses on token attribution and composition through a single decomposition, allowing for the identification of interaction strengths between components. The method has been evaluated on the indirect object identification task, demonstrating its effectiveness across various model sizes.
- ▪Unpack is a backward recursion method that decomposes credit through transformer sublayers.
- ▪The method produces interaction strengths and per-token attribution without requiring gradients or auxiliary training.
- ▪It consistently tracks mechanistic structure across different scales of the Pythia family models.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.23393 (cs) [Submitted on 22 May 2026] Title:Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition Authors:Po-Kai Chen, Niki van Stein, Aske Plaat View a PDF of the paper titled Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition, by Po-Kai Chen and 2 other authors View PDF HTML (experimental) Abstract:Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $\phi(S)U$.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.