How Well Do Models Follow Their Constitutions?
The paper examines how well AI models adhere to their specified behavioral guidelines. It introduces a multi-method audit pipeline to evaluate compliance under real-world conditions. Findings indicate that newer model generations show significant improvements in following their respective specifications.
- ▪The study analyzes models from Anthropic and OpenAI against their published specifications.
- ▪Models demonstrated a decrease in violation rates, with Claude models dropping from 15.0% to 2.0% and GPT models from 11.7% to 3.6%.
- ▪The research highlights remaining issues related to AI identity questioning and the generation of misleading quantitative claims.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24229 (cs) [Submitted on 22 May 2026] Title:How Well Do Models Follow Their Constitutions? Authors:Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda View a PDF of the paper titled How Well Do Models Follow Their Constitutions?, by Arya Jakkli and 2 other authors View PDF HTML (experimental) Abstract:Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024).
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.