InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents
InferenceBench is a benchmark designed to evaluate the optimization capabilities of AI agents for large language model (LLM) serving workloads. The study found that while agents generally outperform standard PyTorch baselines, they do not surpass simple hyperparameter searches within the same time constraints. The benchmark emphasizes the importance of consistent performance and valid submissions in automated research and development.
- ▪InferenceBench assesses AI agents' ability to optimize LLM serving under a fixed compute budget.
- ▪Agents outperformed the vanilla PyTorch baseline but were outperformed by hyperparameter searches.
- ▪The benchmark includes four distinct scenarios targeting different serving bottlenecks.
Opening excerpt (first ~120 words) tap to expand
Benchmarking Open-EndedInference Optimization by AI Agents InferenceBench evaluates whether frontier coding agents can optimize LLM serving workloads under a fixed compute budget. The main bottleneck is not knowing relevant techniques, but consistently running, comparing, and preserving the right experiments. Read Paper View Repository Main results Main Results Across all four scenarios, agents outperform the vanilla PyTorch baseline and most inference engines with default configs (e.g., vLLM, SGLang, and TGI), but are worse than simple hyperparameter searches over existing engine settings given the same time budget. Aggregate performance, geometric mean speedup Agents Search / baselines Bars show geometric-mean speedup; whiskers show ±SEM over seed-pair runs.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Inferencebench.