Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
A recent study explores the advantages of using smaller datasets for training machine learning models. The research indicates that repeating fewer samples can lead to faster training times compared to larger datasets. This approach leverages sampling biases, which can enhance optimization, especially in reasoning tasks.
- ▪The study investigates the 'small-vs-large gap' in machine learning training.
- ▪Repeating smaller datasets can save computational resources during training.
- ▪The findings suggest that smaller datasets with more repetitions can be beneficial for optimization.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.20314 (cs) [Submitted on 19 May 2026] Title:Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases Authors:Jingwen Liu, Ezra Edelman, Surbhi Goel, Bingbin Liu View a PDF of the paper titled Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases, by Jingwen Liu and 3 other authors View PDF HTML (experimental) Abstract:This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.