I A/B tested 4 LLMs on the same 500 queries. The results surprised me.
An A/B test was conducted comparing four large language models (LLMs) using 500 real queries. The results indicated that no single model excelled in all tasks, with performance varying based on the type of query. This suggests that for optimal results in production systems, it is beneficial to route tasks to the most suitable model.
- ▪The fastest model was DeepSeek-V4 Pro, averaging 1.8 seconds per query.
- ▪Qwen3 235B was the most accurate overall with a score of 4.3 out of 5.
- ▪No single model won more than 45% of the task categories, indicating that the best model depends on the specific task.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3924031) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } NovaStack Posted on May 25 I A/B tested 4 LLMs on the same 500 queries. The results surprised me. I see a lot of claims about which model is "best." Best at what? For whom? At what cost? I got tired of guessing. So I ran my own comparison.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).