The first benchmark to test AI agent's video editing capability
A recent benchmark tested the video editing capabilities of AI agents against human experts. The best-performing AI model achieved only 30% accuracy, while human experts scored an average of 89%. This study highlights the significant gap between AI performance and human creativity in post-production tasks.
- ▪The benchmark involved 100 expert-authored tasks across four stages of post-production.
- ▪Human experts scored an average of 89%, while the best AI agent scored only 30%.
- ▪The study emphasizes that both the AI model and the supporting framework, or harness, influence performance.
Opening excerpt (first ~120 words) tap to expand
May 2026Can AI agents do real-world post-production work?We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%.Read the paperLeaderboardCode & dataTasksDiscord100Tasks20Industry experts7Frontier models4Task familiesWhy this benchmark existsVerification is not here for free.RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at AgenticVBench.