Context Rot: Why LLMs Fail as Context Windows Grow

Aug 15, 2025

Claude 4 Sonnet just hit a 1M-token context window this week. Model makers love to brag about perfect needle-in-a-haystack scores, but how do these models really behave when you feed them mountains of text? That is what the researchers at Chrome looked to find out.

In Context Rot, Chroma researchers tested 18 leading LLMs (GPT-4.1, Claude 4, Gemini 2.5, Qwen 3, and more)using two different text sources: Paul Graham’s essays and arXiv research papers.

For each, they hand-crafted “needles” (short, correct answers) and questions so the only way to answer was to find the exact needle in a massive “haystack” of text. This setup let them pinpoint where performance drops as context grows.

They ran a variety of experiments, let’s jump in! (If you’re looking for a more detailed breakdown check out Why Long Context Windows Still Don't Work)

1. Needle-in-a-Haystack with Controlled Ambiguity

As context grows, performance drops faster when the question and supporting snippet are only loosely related (blue and red lines below).
Even the best models degrade once more irrelevant text surrounds the answer.

2. Distractors in Long Context

Even one plausible-looking distractor reduces accuracy. Four make it much worse.
Anthropic models abstain more when unsure. OpenAI models hallucinate confidently.

How the distractors were placed in the text.

4 distractors is worse than 1, which is worse than none

Some distractors derailed models more than others.

Anthropic models are most likely to abstain from answering, while OpenAI rarelys

3. Needle–Haystack Similarity

When the haystack is thematically similar to the needle, retrieval can get harder, but the trend isn’t consistent

Test both **PG needles** and **arXiv needles** inside PG and **arXiv** haystacks to compare “in-domain” vs. “out-of-domain” needles.

4. Haystack Structure

Surprisingly, shuffled (randomized) haystacks outperform coherent ones.

Original (left): Needle in the middle of a coherent article. Shuffled (right): Needles are randomly reordered throughout the haystack.

Shuffled outperforms Original across 18 models! Counterintuitive, but consistent!

5. Multi-Turn QA with Long Chat Histories

All models perform better with focused, relevant history (~300 tokens) than with the entire chat log (113k tokens).
The margin is largest for Anthropic models and smallest for Gemini, which stay impressively close.
“Thinking” modes help, but don’t close the gap.

The Bottom Line
Context is growing fast, but accuracy isn’t keeping pace. Chroma’s results show that once you move beyond clean, lab-style needle-in-a-haystack retrieval processes, long context windows fail in all sorts of ways. Use them wisely, and keep retrieval, filtering, and prompt design in the loop.

The Prompt Engineering Substack

Discussion about this post

Ready for more?