LLM-as-judge variance broke our DPO training signal for 3 weeks
A recent issue with a single LLM as a preference judge in a DPO pipeline led to a significant drop in production accuracy. The judge exhibited a high rate of self-disagreement, which resulted in misleading training signals. After implementing a three-judge consensus system, production accuracy improved, although costs increased.
- ▪The DPO pipeline initially relied on a single LLM, which caused production accuracy to fall by 4 points.
- ▪The judge flipped its own labels 28% of the time, leading to unreliable training signals.
- ▪Switching to a three-judge consensus improved production tool-use accuracy by 2.1 points, despite tripling the cost.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3859428) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Marcus Chen Posted on May 27 LLM-as-judge variance broke our DPO training signal for 3 weeks #machinelearning #llm #mlops #pytorch TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0. The setup Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).