FastKernels: Benchmarking GPU Kernel Generation in Production
FastKernels introduces a new benchmark for GPU kernel generation that addresses the misalignment between existing benchmarks and production environments. The benchmark includes a minimal set of architectures that cover a vast majority of HuggingFace Transformers. Evaluations show that current kernel agents struggle to achieve significant speedup over production baselines, highlighting the need for better alignment in benchmarking.
- ▪FastKernels is designed to improve GPU kernel generation by aligning benchmarks with production inference frameworks.
- ▪The benchmark includes 46 representative architectures that cover 96.2% of HuggingFace Transformers.
- ▪Current state-of-the-art kernel agents achieve only modest speedups over production baselines, indicating a critical bottleneck.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.23215 (cs) [Submitted on 22 May 2026] Title:FastKernels: Benchmarking GPU Kernel Generation in Production Authors:Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari View a PDF of the paper titled FastKernels: Benchmarking GPU Kernel Generation in Production, by Gabriele Oliaro and 7 other authors View PDF HTML (experimental) Abstract:LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.