Confidence Calibration in Large Language Models
A recent study examines the confidence calibration of large language models (LLMs) across various tasks. The findings indicate that LLMs tend to be overly confident, with their confidence levels exceeding their accuracy on average. Additionally, the study introduces LifeEval, a tool designed to assess model calibration based on task difficulty.
- ▪The study reveals that LLMs exhibit overconfidence, particularly on challenging tasks.
- ▪Conversely, LLMs demonstrate significant underconfidence on easier tasks.
- ▪LifeEval is developed as a method for evaluating the calibration of models across different levels of difficulty.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.23909 (cs) [Submitted on 3 Apr 2026] Title:Confidence Calibration in Large Language Models Authors:Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore View a PDF of the paper titled Confidence Calibration in Large Language Models, by Noam Michael and 3 other authors View PDF HTML (experimental) Abstract:We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.