Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

May 25, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 27 views

#artificial intelligence #machine learning #computation

TL;DR · WeSearch summary

The paper discusses positional failures in long-context large language models (LLMs) and their impact on reasoning benchmarks. It highlights that current benchmarks do not control for task position, filler content, and context length, leading to significant performance drops for certain models. The authors propose a new evaluation framework, Context Rot Evaluation (CRE), to address these issues and demonstrate the vulnerabilities of existing models.

Key facts

▪Mainstream reasoning benchmarks do not control the positional placement of target tasks in long contexts.
▪An audit of long-context benchmarks reveals no control over task position, filler content, and context length.
▪The proposed Context Rot Evaluation (CRE) framework shows that models can drop sharply in performance when the target task's position changes.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Computation and Language arXiv:2605.23170 (cs) [Submitted on 22 May 2026] Title:Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks Authors:Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang View a PDF of the paper titled Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks, by Chuyifei Zhang and 3 other authors View PDF HTML (experimental) Abstract:Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Discussion

More from arXiv cs.AI