Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets
The paper discusses the phenomenon of 'grokking' in Transformers, where models take a long time to generalize despite memorizing their training data. It introduces a formalization of attention as a Bayesian posterior and identifies two necessary conditions for generalization. The authors propose that this delay in generalization can be explained as a structural inference process, which can be accelerated through specific interventions.
- ▪The paper formalizes attention in Transformers as an implicit Bayesian posterior over task dependencies.
- ▪It identifies two conditions necessary for generalization: a Goldilocks bound on MLP capacity and a Bayesian structural condition.
- ▪The authors explain delayed generalization as a result of delayed structural inference, which can be bypassed with a KL-based intervention.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.15787 (cs) [Submitted on 15 May 2026] Title:Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets Authors:Kai Hidajat, Solden Stoll, Joseph An View a PDF of the paper titled Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets, by Kai Hidajat and 2 other authors View PDF HTML (experimental) Abstract:Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.