Generalization Dynamics of LM Pre-Training

Jiaxin Wen· May 20, 2026 · 2:10 AM UTC ·4 min read · 0 reactions · 0 comments · 15 views

#machine learning #language models #generalization

via

Github

⚡ TL;DR · AI summary

The article discusses the generalization dynamics of language models (LMs) during pre-training, revealing a phenomenon called mode-hopping. This behavior shows that LMs can abruptly switch between parrot-like and intelligence-like modes, challenging the traditional view of gradual learning. The findings suggest that generalization is influenced by competition between different circuits within the model, and scaling alone does not eliminate mode-hopping.

Key facts

▪The eval suite developed tracks generalization dynamics across LM pre-training.
▪Mode-hopping occurs frequently, with LMs switching between different operational modes during training.
▪Scaling parameters can help mitigate competition between circuits, but do not completely resolve mode-hopping.

Original article

Github · Jiaxin Wen

Read full at Github →

Opening excerpt (first ~120 words) tap to expand

We build an eval suite that exposes such behavioral fingerprints for generalization (see Table 1 for details), and use it to track generalization dynamics across LM pre-training. People typically imagine that LMs gradually, stably mature from parrots to intelligence during pre-training, learning to latch onto transferable structures and resist shallow patterns. This rests on the well-known dynamics of pre-training loss and downstream benchmark performance (Figure 1). We find this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed

Discussion

0 comments

Generalization Dynamics of LM Pre-Training

Discussion

More from Github