Context Rot: How Increasing Input Tokens Impacts LLM Performance
Paper

Context Rot: How Increasing Input Tokens Impacts LLM Performance

2025.07.27
ยทWebยทby Anonymous
#LLM#Context Window#Performance Evaluation

Key Points

  • 1This report challenges the common assumption that Large Language Models process context uniformly, observing that performance varies significantly with input length.
  • 2The study reveals that LLM performance grows increasingly unreliable as the input token count increases.
  • 3These findings are based on an evaluation of 18 LLMs, including state-of-the-art models like GPT-4.1, Claude 4, Gemini 2.5, and Qwen3.

The paper, "Context Rot: How Increasing Input Tokens Impacts LLM Performance," authored by Kelly Hong, Anton Troynikov, and Jeff Huber from Chroma, addresses a fundamental presumption regarding Large Language Models (LLMs): that they process contextual information uniformly across the entire input sequence, meaning a token at the 10,000th position would be handled with the same reliability as one at the 100th. The authors challenge this assumption, asserting that model performance exhibits significant variability as input length changes, even for straightforward tasks.

The core methodology involves an empirical evaluation of 18 distinct LLMs. This comprehensive set of models includes several state-of-the-art architectures, specifically GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. The research systematically assesses how the performance of these models is affected by increasing the number of input tokens, aiming to quantify the extent of performance degradation or unreliability.

The principal finding of the study is that LLMs do not utilize their context uniformly. Instead, their performance demonstrably grows "increasingly unreliable" as the input context length expands. This observed degradation, termed "Context Rot," indicates a systemic challenge in maintaining consistent processing capabilities across extended input sequences. An illustration of this phenomenon is provided with respect to a "Repeated Words Task," where specific models such as Claude Sonnet 4, GPT-4.1, Qwen3-32B, and Gemini 2.5 Flash are identified as exhibiting this length-dependent performance decay.