Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
The paper discusses positional failures in long-context large language models (LLMs) and their impact on reasoning benchmarks. It highlights that current benchmarks do not control for task position, filler content, and context length, leading to significant performance drops for certain models. The authors propose a new evaluation framework, Context Rot Evaluation (CRE), to address these issues and demonstrate the vulnerabilities of existing models.
- ▪Mainstream reasoning benchmarks do not control the positional placement of target tasks in long contexts.
- ▪An audit of long-context benchmarks reveals no control over task position, filler content, and context length.
- ▪The proposed Context Rot Evaluation (CRE) framework shows that models can drop sharply in performance when the target task's position changes.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computation and Language arXiv:2605.23170 (cs) [Submitted on 22 May 2026] Title:Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks Authors:Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang View a PDF of the paper titled Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks, by Chuyifei Zhang and 3 other authors View PDF HTML (experimental) Abstract:Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.