Your benchmarks are lying to you, and your judge is to blame!
A recent benchmark study revealed that the choice of judge significantly impacts the evaluation scores of AI models. When tested with three different judges, one model's score varied by 47 percentage points based on the judge's preference. The findings suggest that relying on a single judge can lead to misleading conclusions about model capabilities.
- ▪The benchmark compared six models across eleven agent skills using three different judges.
- ▪Scores and rankings changed significantly depending on which judge graded the models.
- ▪One model received a 4.6 point boost from its own judge compared to others.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3865880) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Tessl Posted on May 19 Your benchmarks are lying to you, and your judge is to blame! #ai #agentskills #agents #security Last week I published a benchmark comparing six models across eleven agent skills. The numbers in that post are averages, and we did not explain why. When I shared the data internally, Maria from our AI Research team pointed out something that we should take very seriously: an LLM judge is likely to favour outputs from its own model family.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).