Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

May 20, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 15 views

#artificial intelligence #machine learning #statistics

⚡ TL;DR · AI summary

The article discusses a new algorithm for reinforcement learning in multinomial logistic Markov Decision Processes (MDPs). This algorithm achieves improved regret bounds compared to existing methods, particularly for structured MDPs. The authors establish both upper and lower bounds, demonstrating the minimax optimality of their approach.

Key facts

▪The proposed algorithm achieves a regret of O(dH^2barσT√T).
▪For KL-constrained robust MDPs, the algorithm reduces the horizon dependence by a factor of H.
▪This work fully characterizes the regret complexity of multinomial logistic mixture MDPs for the first time.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.19768 (cs) [Submitted on 19 May 2026] Title:Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs Authors:Pierre Boudart (SIERRA), Pierre Gaillard (Thoth), Alessandro Rudi (PSL, DI-ENS, Inria) View a PDF of the paper titled Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs, by Pierre Boudart (SIERRA) and 4 other authors View PDF Abstract:We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Discussion

More from arXiv cs.AI