WeSearch

Evolutionary Data Making – How to train embedding models

·24 min read · 0 reactions · 0 comments · 15 views
#technology#data#machine learning
Evolutionary Data Making – How to train embedding models
⚡ TL;DR · AI summary

The article discusses the development of a new method for acquiring training data for embedding models used in search systems. By employing evolutionary principles, the authors created a system that refines data generation policies to improve search results. This approach led to significant performance improvements in their model, demonstrating the importance of data generation processes in machine learning.

Key facts
Original article
Wafer
Read full at Wafer →
Opening excerpt (first ~120 words) tap to expand

← Blog Evolutionary Data Making Chris Gresla 2026-03-17 TLDR We needed a better way to acquire training data for the embedding model that powers search on our phone OS. Static data generation methods rely on heuristics to pair queries with relevant documents, capturing obvious associations but failing to scale or find nuanced data. Inspired by the principles of evolution, we built a search system where frontier LLMs explore, grade, and refine data generation policies, guided by a constitution of quality principles we call The Good Data Manifesto. A 0.6B parameter model trained on this data improved NDCG@10 by 37% and won or tied 82% of blind head-to-head comparisons on real user queries.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Wafer.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Wafer