Evolutionary Data Making – How to train embedding models
The article discusses the development of a new method for acquiring training data for embedding models used in search systems. By employing evolutionary principles, the authors created a system that refines data generation policies to improve search results. This approach led to significant performance improvements in their model, demonstrating the importance of data generation processes in machine learning.
- ▪The new data acquisition method is inspired by evolutionary principles.
- ▪A 0.6B parameter model trained on this data improved search performance by 37%.
- ▪The methodology for generating high-quality retrieval training data remains largely unshared among leading providers.
Opening excerpt (first ~120 words) tap to expand
← Blog Evolutionary Data Making Chris Gresla 2026-03-17 TLDR We needed a better way to acquire training data for the embedding model that powers search on our phone OS. Static data generation methods rely on heuristics to pair queries with relevant documents, capturing obvious associations but failing to scale or find nuanced data. Inspired by the principles of evolution, we built a search system where frontier LLMs explore, grade, and refine data generation policies, guided by a constitution of quality principles we call The Good Data Manifesto. A 0.6B parameter model trained on this data improved NDCG@10 by 37% and won or tied 82% of blind head-to-head comparisons on real user queries.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Wafer.