The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
The paper discusses the advantages of Gated Linear Units (GLU) over non-gated structures in machine learning models. It highlights how GLU reshapes the neural tangent kernel spectrum, resulting in faster convergence during training. The findings suggest that while GLU improves optimization speed, it does not significantly reduce the generalization gap across various models.
- ▪Gated Linear Units (GLU) outperform non-gated counterparts in large language models.
- ▪The analysis reveals that GLU leads to a smaller condition number and a compact eigenvalue distribution.
- ▪GLU primarily accelerates optimization rather than reducing the generalization gap.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.20749 (cs) [Submitted on 20 May 2026] Title:The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure? Authors:Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Peisong Wen, Qingming Huang View a PDF of the paper titled The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?, by Xingyu Lyu and 4 other authors View PDF HTML (experimental) Abstract:Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.