Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
A recent study investigates the latent biases in instruction-tuned language models used for high-stakes decisions. While these models demonstrate fair outputs, they retain biased internal representations that can significantly influence decision-making. The research highlights the need for dual-layer testing frameworks to address these internal biases in AI governance.
- ▪Instruction-tuned language models show behavioral fairness in high-stakes decisions but retain biased associations internally.
- ▪The study reveals that suppressed internal representations can affect model outputs, leading to decision reversals when reintroduced.
- ▪Latent bias in these models is asymmetric, impacting decisions in one demographic direction more than the other.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.15217 (cs) [Submitted on 12 May 2026] Title:Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions Authors:Jagdish Tripathy, Marcus Buckmann View a PDF of the paper titled Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions, by Jagdish Tripathy and 1 other authors View PDF HTML (experimental) Abstract:Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.