Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments
The paper investigates the reliability of user state classifications made by large language models in operational environments. It highlights the instability of individual score metrics, which complicates their use in real-time adaptive systems. The study proposes a framework for evaluating metric applicability, emphasizing the need for validation in AI design.
- ▪The study tested the psychometric reliability of AI measures across three large language models.
- ▪Only 31 out of 213 metrics met the reliability criteria for individual scores.
- ▪Individually unstable metrics can still be useful for post-hoc analysis of user interactions.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.15734 (cs) [Submitted on 15 May 2026] Title:Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments Authors:Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska View a PDF of the paper titled Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments, by Izabella Krzeminska and 2 other authors View PDF Abstract:The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.