OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.
TinyFish conducted a benchmark test comparing their web agent to OpenAI's Operator. The results showed TinyFish scoring 81% on complex web tasks, while OpenAI's Operator scored only 43%. The evaluation involved 300 tasks across various websites, highlighting the challenges of real-world applications for web agents.
- ▪TinyFish tested their web agent against public benchmarks to evaluate performance.
- ▪The benchmark included 300 tasks with varying difficulty levels and human evaluation.
- ▪TinyFish's agent succeeded in 81% of the tasks, while OpenAI's Operator managed only 43%.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3933533) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Tinyfishie Posted on May 19 • Originally published at tinyfish.ai OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs. #agents #ai #automation #showdev TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale. But we also wanted to test ourselves against the public benchmarks.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).