I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

May 25, 2026 · 12:47 AM UTC ·3 min read · 0 reactions · 0 comments · 11 views

⚡ TL;DR · AI summary

An A/B test was conducted comparing four large language models (LLMs) using 500 real queries. The results indicated that no single model excelled in all tasks, with performance varying based on the type of query. This suggests that for optimal results in production systems, it is beneficial to route tasks to the most suitable model.

Key facts

▪The fastest model was DeepSeek-V4 Pro, averaging 1.8 seconds per query.
▪Qwen3 235B was the most accurate overall with a score of 4.3 out of 5.
▪No single model won more than 45% of the task categories, indicating that the best model depends on the specific task.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3924031) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } NovaStack Posted on May 25 I A/B tested 4 LLMs on the same 500 queries. The results surprised me. I see a lot of claims about which model is "best." Best at what? For whom? At what cost? I got tired of guessing. So I ran my own comparison.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

Discussion

More from DEV.to (Top)