WeSearch

OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

·5 min read · 0 reactions · 0 comments · 9 views
#ai#automation#web agents
OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.
⚡ TL;DR · AI summary

TinyFish conducted a benchmark test comparing their web agent to OpenAI's Operator. The results showed TinyFish scoring 81% on complex web tasks, while OpenAI's Operator scored only 43%. The evaluation involved 300 tasks across various websites, highlighting the challenges of real-world applications for web agents.

Key facts
Original article
DEV.to (Top)
Read full at DEV.to (Top) →
Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3933533) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Tinyfishie Posted on May 19 • Originally published at tinyfish.ai OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs. #agents #ai #automation #showdev TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale. But we also wanted to test ourselves against the public benchmarks.

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from DEV.to (Top)