How I use an LLM as a translation judge
The article discusses the use of GEMBA-MQM v2 to evaluate translation quality in speech-to-speech translation. It highlights how this system employs an LLM to automate the annotation process, providing structured error breakdowns similar to human reviewers. However, it also notes the variability in scores produced by LLMs and suggests running multiple passes to achieve more reliable results.
- ▪GEMBA-MQM v2 uses MQM to evaluate translation quality by classifying errors by type and severity.
- ▪The system automates the annotation process, yielding structured error breakdowns akin to those from human reviewers.
- ▪LLM judges can produce inconsistent scores, prompting the recommendation to run multiple passes for accuracy.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3939997) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Yahya Saleh Posted on May 22 • Originally published at voicefrom.ai How I use an LLM as a translation judge #opensource #ai #llm I use GEMBA-MQM v2 to evaluate translation quality in my live speech-to-speech translation pipeline. MQM (Multidimensional Quality Metrics) is an open industry standard for grading translations. Instead of a single score, it classifies every error by type (mistranslation, omission, hallucination, grammar, etc.) and severity (critical, major, minor).
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).