BenchBench
BenchBench is a new benchmark designed to evaluate how well AI models can create benchmarks for themselves. GPT 5.2 emerged as the only successful model in this task, while others struggled to produce effective benchmarks. The initiative highlights the distinction between models' abilities as creators versus solvers, revealing interesting insights into their capabilities.
- ▪BenchBench evaluates AI models on their ability to create effective benchmarks.
- ▪GPT 5.2 was the only model that successfully created a useful benchmark.
- ▪Other models, including GPT 5.5 and Opus 4.6, struggled to produce challenging benchmarks.
Opening excerpt (first ~120 words) tap to expand
Introducing BenchBenchRohit KrishnanMay 25, 20261552ShareTL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winnerModels are getting much much better at almost every benchmark we’ve thrown at them. Creating benchmarks is now a job relegated to the smartest and best of us. Even the newest and best ones seem to get saturated in record time. What this means is that increasingly the hardest job is to create a good enough AI benchmark.So I took the obvious next step. Created a benchmark to see how well the models can create a benchmark.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News (Newest).