What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
The paper discusses a new framework called SERL for improving reinforcement learning in multi-turn agents. It focuses on utilizing selective environment feedback to enhance learning efficiency and success rates. The results show that SERL significantly outperforms existing methods in specific task environments.
- ▪SERL stands for selective environment-reweighted learning framework.
- ▪The framework uses task rewards to guide update directions while adjusting based on environmental feedback.
- ▪SERL achieved success rates of 90.0% and 80.1% on ALFWorld and WebShop, respectively.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.19447 (cs) [Submitted on 19 May 2026] Title:What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents Authors:Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen View a PDF of the paper titled What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents, by Xiaozhe Li and 8 other authors View PDF HTML (experimental) Abstract:Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.