The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
The article discusses the challenges of benchmark auditing in artificial intelligence, particularly regarding contamination detection. It highlights the reliability gap between controlled validation and practical auditing scenarios. The authors identify failure modes related to distribution shift and scale constraints, revealing that current statistical methods are insufficient for reliable auditing.
- ▪Benchmark contamination threatens the validity of large language model assessments.
- ▪The study evaluates three paradigms across 27 models and finds a significant reliability gap.
- ▪Only 199 out of 335 evaluations yielded correct outcomes, indicating issues with existing statistical detection methods.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2606.03305 (cs) [Submitted on 2 Jun 2026] Title:The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection Authors:Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert View a PDF of the paper titled The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection, by Wojciech Zarzecki and 2 other authors View PDF HTML (experimental) Abstract:Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.