Capturing LLM Capabilities via Evidence-Calibrated Query Clustering
The paper presents a new algorithm called ECC for clustering queries based on their latent capability demands. This method aims to improve the evaluation of large language models (LLMs) by aligning surface-level semantics with actual model performance. Extensive evaluations show that ECC significantly enhances capability ranking quality compared to existing methods.
- ▪ECC calibrates prior semantic embeddings using limited posterior model comparisons.
- ▪The algorithm characterizes each cluster through a capability profile parameterized by a Bradley-Terry model.
- ▪ECC outperforms human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.17110 (cs) [Submitted on 16 May 2026] Title:Capturing LLM Capabilities via Evidence-Calibrated Query Clustering Authors:Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang View a PDF of the paper titled Capturing LLM Capabilities via Evidence-Calibrated Query Clustering, by Fangzhou Wu and 2 other authors View PDF Abstract:Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.