Context window management for AI agents in production

Your agent works great until it doesn't, around message 40

The first 30 messages of an agent conversation are smooth. Then around message 40 or 50, quality drops off a cliff. Responses get slower, more repetitive, and sometimes the model simply refuses to answer because the context is too long. You check the logs: the message history is 180K tokens. Your request cost is climbing by $0.50 per turn and the answers are getting worse, not better.

This is the context window problem, and it is the most underrated production issue in agent development. Every tutorial shows agents working on a clean fresh context. Nobody shows what happens at message 100, which is exactly when real users start doing real work. The fix is not a bigger window; it is a management strategy.

This post is the 4 strategies for bounding context in production agents, when to use each one, and the code that stitches them together so your agent still answers well at message 500 without costing you $100 per conversation.

Why do long sessions break agent quality?

Because LLMs exhibit "lost in the middle" behavior: they attend more to the start and end of the context and less to the middle. As the context grows, important information from the middle of the session effectively disappears from the model's attention, even though the tokens are technically present.

3 concrete failure modes you will see:

Forgotten instructions. The user said "use pytest, not unittest" at message 3. By message 50, the model has added a unittest import because it can no longer see the early instruction.
Redundant answers. The model re-explains things it already explained 20 messages ago because it lost track of what the user already knows.
Repetition loops. The model gets stuck producing the same summary or suggestion because its recent context keeps reinforcing the same patterns.

None of these are hallucinations in the traditional sense. They are context management failures. The model is attending to the wrong subset of a window that is technically within its limit.

graph TD
    Session[Long session] --> Raw[Raw history: 180k tokens]
    Raw -->|sent as-is| Quality[Quality drops, cost climbs]

    Session --> Managed[Context manager]
    Managed -->|4 strategies| Clean[20k tokens, key info preserved]
    Clean --> Good[Quality stays, cost bounded]

    style Quality fill:#fee2e2,stroke:#b91c1c
    style Good fill:#dcfce7,stroke:#15803d

The context manager is the difference between "works at turn 10" and "still works at turn 500."

What are the 4 context management strategies?

Strategy 1: truncation

Drop the oldest messages once the total token count exceeds a threshold. Simple, predictable, and lossy: anything dropped is gone.

Use when: the session is a task-oriented flow where old context stops being relevant after a few turns. Example: "fix the bug, run the tests, fix the next bug." Once test 1 is green, the discussion around it is no longer needed.

# filename: truncate.py
# description: Drop the oldest messages until the history fits the budget.
# Keep the system prompt and the latest N turns.
def truncate_history(messages: list[dict], max_tokens: int, count_tokens) -> list[dict]:
    system = [m for m in messages if m['role'] == 'system']
    rest = [m for m in messages if m['role'] != 'system']

    total = sum(count_tokens(m) for m in system)
    kept = []
    for msg in reversed(rest):
        total += count_tokens(msg)
        if total > max_tokens:
            break
        kept.append(msg)
    return system + list(reversed(kept))

Keep the system prompt always. Drop from the front of the non-system history. The oldest messages go first. This is the cheapest strategy and it works for a surprising number of workloads.

Strategy 2: summarization

Periodically summarize the oldest half of the history into a compact paragraph and replace those messages with the summary. Preserves the gist of early context at the cost of an extra LLM call.

Use when: early context matters but does not need to be preserved verbatim. Example: "we refactored auth and discussed the trade-offs" compressed into a 50-word summary is often enough.

# filename: summarize.py
# description: Summarize the oldest half of the conversation into a
# compact note and replace with a single system message.
from anthropic import Anthropic

client = Anthropic()


def summarize_oldest_half(messages: list[dict]) -> list[dict]:
    half = len(messages) // 2
    old, new = messages[:half], messages[half:]

    old_text = '\n'.join(f'{m["role"]}: {m["content"]}' for m in old)
    reply = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=400,
        messages=[{
            'role': 'user',
            'content': f'Summarize this conversation in 100 words preserving '
                       f'decisions, code references, and open questions:\n\n{old_text}',
        }],
    )
    summary = {'role': 'system', 'content': f'Summary of earlier conversation:\n{reply.content[0].text}'}
    return [summary, *new]

Run this when the history exceeds a threshold (say, 30K tokens). The summary is cheaper to re-send on every turn than the original 30K tokens, and the model keeps enough context to stay coherent.

Strategy 3: retrieval

Store old messages in a vector store and retrieve only the relevant ones for each new turn. Expensive to set up, cheapest at steady state. Only works if the session is long enough to amortize the indexing cost.

Use when: sessions regularly exceed 100 messages and users frequently reference things said earlier. Example: "what did I ask you about the login flow 20 messages back?"

The pattern: on every assistant response, embed it and index it. On every new user turn, retrieve the top 5 most relevant old messages and include them in the context alongside the last 10 verbatim.

Strategy 4: sliding window with sticky notes

Keep the last N messages verbatim, plus a set of "sticky" facts extracted from the full session. The sticky notes are structured memory (user preferences, decisions, project facts) that survives across the sliding window.

Use when: you want the simplicity of truncation without losing user preferences and project-level facts. This is the pattern most production agents converge on.

# filename: sliding_sticky.py
# description: Sliding window over recent messages plus a stored set
# of extracted sticky facts that persist.
def build_context(
    system_prompt: str,
    sticky_facts: list[str],
    recent: list[dict],
) -> list[dict]:
    fact_block = 'Persistent facts:\n' + '\n'.join(f'- {f}' for f in sticky_facts)
    return [
        {'role': 'system', 'content': f'{system_prompt}\n\n{fact_block}'},
        *recent,
    ]

The sticky facts come from the extraction pattern covered in the Persistent Memory for Coding Agents: Cross-Session Context post. Apply it mid-session, not just at session end, to keep the sticky list fresh.

Which strategy should you pick?

Workload	Truncate	Summarize	Retrieve	Sliding+Sticky
Short task sessions	Wins	Overkill	Overkill	Overkill
Long planning sessions	Fails	Wins	OK	Wins
Very long research	Fails	OK	Wins	OK
Cross-session agents	Fails	Fails	Partial	Wins (with persistent memory)

The rule I use: start with sliding window plus sticky notes. It covers 80 percent of workloads with modest implementation cost. Add summarization when sessions regularly exceed 30K tokens. Add retrieval only when users frequently ask about things from the distant past.

For the persistent memory layer that provides the sticky facts across sessions, see the Persistent Memory for Coding Agents: Cross-Session Context post. For the full production agent stack that uses context management alongside observability and cost control, the Build Your Own Coding Agent course covers it module by module.

How do you measure context drift?

The symptom is that quality drops at a consistent message count. Measure by running an eval set at turn 5, turn 50, and turn 200 in a simulated conversation. The accuracy delta between turn 5 and turn 200 is your "context drift" number.

Good context management keeps the delta small (under 5 points). Poor management shows a cliff around turn 40 to 60 where accuracy drops 15 to 25 points. Measuring this number is the only way to know if your strategy actually works.

Do this measurement once per strategy change. Before adding summarization, measure. After adding summarization, measure again. If the delta did not shrink, the strategy is not helping.

When should you reset the session entirely?

When the user changes topic or workflow clearly. "Let's work on something else" is a signal. "New task: refactor auth" is a signal. At those boundaries, summarize the old session into persistent memory and start a new session with a clean context.

Don't reset silently. Tell the user "I'm starting a new session. Here is a summary of what we just covered." The transparency earns trust and the user can correct you if they expected continuity.

For the broader session model that supports this, see the User and Session Models for Multi-Tenant AI Agents post. The free AI Agents Fundamentals primer is the right starting point if the agent loop concept is still new.

What to do Monday morning

Measure your current context drift. Run an eval at turn 5, 50, and 200 in a simulated session. If the delta is under 5 points, you probably do not need new strategies yet.
If drift is high, add the sliding window plus sticky fact pattern first. It covers most workloads without much complexity.
Add summarization when sessions regularly exceed 30K tokens. A Haiku summary call is cheap and cuts context cost noticeably.
Add retrieval only if users complain that the agent "forgot" something from 100 messages back. This is the most expensive strategy and the least often necessary.
Measure drift again after each change. Without measurement, you are guessing at whether the strategy worked.

The headline: context management is 4 strategies picked by workload, not a single magic solution. Start with sliding window plus sticky facts, add summarization when sessions get long, add retrieval only when you have to. Measure drift to confirm your strategy is working.

Frequently asked questions

What is context window management in AI agents?

It is the set of strategies used to keep the conversation history short enough to fit efficiently inside the model's context window without losing information the agent needs. Without management, long sessions either exceed the window entirely or degrade quality because the model attends less to the middle of a growing history. Management keeps quality high and cost bounded.

Why does agent quality drop in long sessions?

Because LLMs exhibit lost-in-the-middle behavior: they attend more to the start and end of the context and less to the middle. As the history grows, important instructions and decisions from the middle lose weight in the model's attention even though they are technically still in the window. Symptoms include forgotten instructions, repetition, and redundant answers.

What context window management strategies are worth using?

4 strategies cover almost every production workload: truncation (drop oldest messages), summarization (compress old history into a summary), retrieval (store old messages in a vector store and pull only relevant ones), and sliding window with sticky facts (keep recent messages verbatim plus extracted persistent facts). Most production agents converge on sliding window plus sticky facts.

How do I measure context drift in my agent?

Run your eval set at different session lengths: turn 5, turn 50, turn 200. The accuracy delta between early and late turns is your context drift number. Good context management keeps the delta under 5 points. Poor management shows a cliff around turn 40 to 60 with 15+ point drops. Measure once per strategy change.

When should an agent reset its session instead of managing context?

When the user clearly changes topic or workflow. "Let's work on something else" is a signal. "New task: refactor auth" is a signal. At those boundaries, summarize the old session into persistent memory and start a new session with a clean context. Tell the user what you are doing so they can correct you if they expected continuity.

Key takeaways

Agent quality drops in long sessions because LLMs lose focus on the middle of growing contexts, not because they hit the token limit.
4 strategies manage context: truncation (cheapest, lossy), summarization (balanced), retrieval (expensive, thorough), sliding window with sticky facts (the production default).
Start with sliding window plus sticky facts. It covers most workloads. Add summarization when sessions regularly exceed 30K tokens.
Retrieval is for the specific case of long research sessions where users ask about things said 100 messages back. Otherwise skip it.
Measure context drift with an eval set at multiple session lengths. Without measurement, you cannot tell if your strategy works.
To see context management wired into a full production agent stack with memory, auth, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the research on lost-in-the-middle attention in long contexts, see Liu et al., Lost in the Middle: How Language Models Use Long Contexts. The paper's graphs of attention vs position are a compelling case for active context management even when you are well under the token limit.

Your agent works great until it doesn't, around message 40

Why do long sessions break agent quality?

3 concrete failure modes you will see:

Forgotten instructions. The user said "use pytest, not unittest" at message 3. By message 50, the model has added a unittest import because it can no longer see the early instruction.
Redundant answers. The model re-explains things it already explained 20 messages ago because it lost track of what the user already knows.
Repetition loops. The model gets stuck producing the same summary or suggestion because its recent context keeps reinforcing the same patterns.

None of these are hallucinations in the traditional sense. They are context management failures. The model is attending to the wrong subset of a window that is technically within its limit.

graph TD
    Session[Long session] --> Raw[Raw history: 180k tokens]
    Raw -->|sent as-is| Quality[Quality drops, cost climbs]

    Session --> Managed[Context manager]
    Managed -->|4 strategies| Clean[20k tokens, key info preserved]
    Clean --> Good[Quality stays, cost bounded]

    style Quality fill:#fee2e2,stroke:#b91c1c
    style Good fill:#dcfce7,stroke:#15803d

The context manager is the difference between "works at turn 10" and "still works at turn 500."

What are the 4 context management strategies?

Strategy 1: truncation

Drop the oldest messages once the total token count exceeds a threshold. Simple, predictable, and lossy: anything dropped is gone.

# filename: truncate.py
# description: Drop the oldest messages until the history fits the budget.
# Keep the system prompt and the latest N turns.
def truncate_history(messages: list[dict], max_tokens: int, count_tokens) -> list[dict]:
    system = [m for m in messages if m['role'] == 'system']
    rest = [m for m in messages if m['role'] != 'system']

    total = sum(count_tokens(m) for m in system)
    kept = []
    for msg in reversed(rest):
        total += count_tokens(msg)
        if total > max_tokens:
            break
        kept.append(msg)
    return system + list(reversed(kept))

Keep the system prompt always. Drop from the front of the non-system history. The oldest messages go first. This is the cheapest strategy and it works for a surprising number of workloads.

Strategy 2: summarization

Periodically summarize the oldest half of the history into a compact paragraph and replace those messages with the summary. Preserves the gist of early context at the cost of an extra LLM call.

Use when: early context matters but does not need to be preserved verbatim. Example: "we refactored auth and discussed the trade-offs" compressed into a 50-word summary is often enough.

# filename: summarize.py
# description: Summarize the oldest half of the conversation into a
# compact note and replace with a single system message.
from anthropic import Anthropic

client = Anthropic()


def summarize_oldest_half(messages: list[dict]) -> list[dict]:
    half = len(messages) // 2
    old, new = messages[:half], messages[half:]

    old_text = '\n'.join(f'{m["role"]}: {m["content"]}' for m in old)
    reply = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=400,
        messages=[{
            'role': 'user',
            'content': f'Summarize this conversation in 100 words preserving '
                       f'decisions, code references, and open questions:\n\n{old_text}',
        }],
    )
    summary = {'role': 'system', 'content': f'Summary of earlier conversation:\n{reply.content[0].text}'}
    return [summary, *new]

Run this when the history exceeds a threshold (say, 30K tokens). The summary is cheaper to re-send on every turn than the original 30K tokens, and the model keeps enough context to stay coherent.

Strategy 3: retrieval

Use when: sessions regularly exceed 100 messages and users frequently reference things said earlier. Example: "what did I ask you about the login flow 20 messages back?"

The pattern: on every assistant response, embed it and index it. On every new user turn, retrieve the top 5 most relevant old messages and include them in the context alongside the last 10 verbatim.

Strategy 4: sliding window with sticky notes

Use when: you want the simplicity of truncation without losing user preferences and project-level facts. This is the pattern most production agents converge on.

# filename: sliding_sticky.py
# description: Sliding window over recent messages plus a stored set
# of extracted sticky facts that persist.
def build_context(
    system_prompt: str,
    sticky_facts: list[str],
    recent: list[dict],
) -> list[dict]:
    fact_block = 'Persistent facts:\n' + '\n'.join(f'- {f}' for f in sticky_facts)
    return [
        {'role': 'system', 'content': f'{system_prompt}\n\n{fact_block}'},
        *recent,
    ]

Which strategy should you pick?

Workload	Truncate	Summarize	Retrieve	Sliding+Sticky
Short task sessions	Wins	Overkill	Overkill	Overkill
Long planning sessions	Fails	Wins	OK	Wins
Very long research	Fails	OK	Wins	OK
Cross-session agents	Fails	Fails	Partial	Wins (with persistent memory)

Measure your current context drift. Run an eval at turn 5, 50, and 200 in a simulated session. If the delta is under 5 points, you probably do not need new strategies yet.
If drift is high, add the sliding window plus sticky fact pattern first. It covers most workloads without much complexity.
Add summarization when sessions regularly exceed 30K tokens. A Haiku summary call is cheap and cuts context cost noticeably.
Add retrieval only if users complain that the agent "forgot" something from 100 messages back. This is the most expensive strategy and the least often necessary.
Measure drift again after each change. Without measurement, you are guessing at whether the strategy worked.

Agent quality drops in long sessions because LLMs lose focus on the middle of growing contexts, not because they hit the token limit.
4 strategies manage context: truncation (cheapest, lossy), summarization (balanced), retrieval (expensive, thorough), sliding window with sticky facts (the production default).
Start with sliding window plus sticky facts. It covers most workloads. Add summarization when sessions regularly exceed 30K tokens.
Retrieval is for the specific case of long research sessions where users ask about things said 100 messages back. Otherwise skip it.
Measure context drift with an eval set at multiple session lengths. Without measurement, you cannot tell if your strategy works.
To see context management wired into a full production agent stack with memory, auth, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

Context window management for production AI agents

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Context window management for production AI agents

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?