Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
The paper discusses weight decay regimes in transformers trained on modular arithmetic. It introduces online diagnostics to track training dynamics and identifies key transitions between memorization, generalization, and collapse. The findings are based on extensive experiments across various model scales and conditions.
- ▪Weight decay serves as a scalar empirical control parameter for different training regimes in transformers.
- ▪Two online diagnostics were introduced to monitor training dynamics using attention activations.
- ▪The study involved eleven experimental conditions and three model scales, revealing a specific boundary for memorization and developmental grokking.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.20441 (cs) [Submitted on 19 May 2026] Title:Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics Authors:Lucky Verma View a PDF of the paper titled Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics, by Lucky Verma View PDF HTML (experimental) Abstract:Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.