WeSearch

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

·3 min read · 0 reactions · 0 comments · 15 views
#machine learning#artificial intelligence#neural computing
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
⚡ TL;DR · AI summary

The paper discusses weight decay regimes in transformers trained on modular arithmetic. It introduces online diagnostics to track training dynamics and identifies key transitions between memorization, generalization, and collapse. The findings are based on extensive experiments across various model scales and conditions.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.20441 (cs) [Submitted on 19 May 2026] Title:Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics Authors:Lucky Verma View a PDF of the paper titled Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics, by Lucky Verma View PDF HTML (experimental) Abstract:Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI