RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

May 22, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 11 views

#human-computer interaction #artificial intelligence #user simulation

⚡ TL;DR · AI summary

The paper introduces RealUserSim, a new user simulation framework designed to improve agent benchmarking by grounding simulations in real behavioral data. It highlights the limitations of current LLM-based simulations, which often fail to accurately represent human behavior. By utilizing data from over 14,000 authentic conversations, the framework significantly enhances the fidelity of agent evaluations.

Key facts

▪RealUserSim is the first user simulation framework grounded in real behavioral data.
▪The framework improves match rates from 24.2% to 45.3% across five behavioral dimensions.
▪Grounded simulation reveals failure mechanisms that are not visible in cooperative simulators.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Human-Computer Interaction arXiv:2605.20204 (cs) [Submitted on 7 Apr 2026] Title:RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation Authors:Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang View a PDF of the paper titled RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation, by Ming Zhu and 8 other authors View PDF HTML (experimental) Abstract:LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Discussion

More from arXiv cs.AI