In a text chatbot, "memory" is easy. If the user scrolls up, they see the history. If the bot forgets something, the user can just re-read the previous messages.

In Voice AI, the rules change completely.

  1. Context is Invisible: The user cannot "scroll up." If the bot forgets that my name is Bob, the illusion of intelligence shatters immediately.
  2. Context is Latency: Every token you send to the LLM adds milliseconds to the response time. Sending a 10-minute transcript (approx. 1,500 words) to GPT-4o doesn't just cost money; it adds a 1-2 second processing delay.

In voice, Latency is the enemy.

This post explores how to manage conversation memory so your bot stays smart enough to remember you, but light enough to respond instantly.

What is the context bloat curve?

Imagine a 10-minute customer support call.

  • Minute 1: History is short. Latency is 300ms. Snappy.
  • Minute 5: History is 2,000 tokens. Latency creeps to 800ms.
  • Minute 10: History is 4,000 tokens. Latency spikes to 1.5s. The user starts interrupting the bot because it feels slow.

We cannot simply append every User and Assistant message to the list forever. We need a strategy to prune the history while keeping the meaning.

What is the sliding window approach?

For a fluid voice conversation, the bot usually only needs the last 3-4 turns to understand immediate context (e.g., "Yes, that works" or "No, the other one").

We implement a Sliding Window manager that keeps the System Prompt fixed (the "Personality") but strictly trims the middle of the conversation.

graph LR
    subgraph RAW["Raw Conversation History"]
        A[System] --> B[Turn 1]
        B --> C[Turn 2]
        C --> D[Turn 3]
        D --> E[Turn 4]
        E --> F[Turn 5]
    end
    
    subgraph WINDOW["Sliding Window: Context sent to LLM"]
        A2[System] --> D2[Turn 3]
        D2 --> E2[Turn 4]
        E2 --> F2[Turn 5]
    end
    
    style B fill:#ffebee,stroke:#b71c1c
    style C fill:#ffebee,stroke:#b71c1c

The Implementation:

In LiveKit agents, the context is often managed automatically, but for production, you want explicit control.

# filename: example.py
# description: Code example from the post.
# A simple manual pruner
def prune_context(chat_ctx):
    # Always keep the System Prompt (index 0)
    system_prompt = chat_ctx.messages[0]
    
    # Get the rest of the history
    history = chat_ctx.messages[1:]
    
    # Keep only the last 6 messages (3 turns)
    if len(history) > 6:
        history = history[-6:]
        
    return [system_prompt] + history

Pros: Zero latency overhead. Extremely cheap.

Cons: The "Goldfish Effect." If I said "My name is Bob" at minute 1, and the window slides past it, the bot forgets my name at minute 3.

What is the sidecar summarizer?

To solve the Goldfish Effect without bloating the main context, we use a Background Process.

While the main agent is chatting, a second, smaller LLM (the "Sidecar") runs in the background. It watches the conversation and updates a "Summary" section in the System Prompt.

graph TD
    A[Voice Conversation Stream] --> B(Main Agent Loop)
    A --> C(Background Sidecar Worker)
    
    C --> D[Extract Facts: User is Bob, Wants Pizza]
    D --> E[Update System Prompt]
    
    E --> B
    
    style C fill:#fff9c4,stroke:#fbc02d
    style E fill:#e3f2fd,stroke:#0d47a1

The Implementation:

We use an async task so we don't block the audio stream.

async def background_summarizer(full_history, agent):
    """
    Runs periodically to compress history into facts.
    """
    # We use a cheap, fast model (like gpt-4o-mini) for summarization
    summary = await cheap_llm.generate(
        f"Extract key facts from this conversation history: {full_history}"
    )
    
    # We inject these facts into the 'hidden' context of the main agent
    new_system_prompt = f"""
    You are a helpful assistant.
    
    CORE MEMORY (DO NOT FORGET):
    {summary}
    """
    
    # Update the running agent's prompt live
    agent.update_system_prompt(new_system_prompt)

Pros: Retains long-term context (names, preferences) without growing token count.

Cons: There is a delay. The summary might update 10 seconds after the user says the fact.

What does strategy 3: structured state extraction (the "pro" move) look like?

Summaries are fuzzy. "User wants pizza" is text.

For reliable applications (like ordering food), we don't want text summaries; we want Structured Data.

Instead of summarizing, we give the agent a tool called update_order or save_profile. The agent "offloads" memory to a structured object.

  • User: "I want a pepperoni pizza."
  • Agent (Thought): User provided data. I will call update_order(item="pepperoni pizza").
  • System: Updates order_state = {"items": ["pepperoni pizza"]}.
  • System: Injects Current Order: 1x Pepperoni Pizza into the System Prompt.

This keeps the prompt tiny but the memory perfect.

Engineering trade-off matrix

Strategy Latency Impact Recall Quality Token Cost Best For
Full History High (Bad) Perfect High Short demos (< 2 mins)
Sliding Window Low (Good) Low (Forgets) Low Casual chat / Small talk
Async Summary Low (Good) Medium (Fuzzy) Medium Support bots / General Q&A
Structured State Low (Good) High (Precise) Low Transactional Bots (Ordering, Booking)

Challenge for you

Scenario: You are building a Medical Intake Voice Bot.

  • Requirement: The call might last 20 minutes. You must capture every symptom mentioned, even if it was said at minute 1. You cannot lose data.
  • Constraint: You cannot simply keep 20 minutes of text in the prompt (latency will be too high).

Your Task:

  1. Why would Sliding Window fail here?
  2. Why might Async Summarization be risky (think about "hallucinating" a symptom)?
  3. Design a Structured State solution. What would your Pydantic schema look like for PatientData? How would you prompt the agent to save symptoms as they are spoken?

Frequently asked questions

How much does conversation history impact voice AI latency?

Every token in conversation history adds milliseconds to processing time. A 10-minute call (4,000 tokens) adds 1+ second of latency, making the bot feel slow and unresponsive. This is why voice bots can't simply keep full history like text bots. The post explores three strategies to manage this trade-off: sliding windows, async summarization, and structured state extraction, each with different latency and recall profiles.

Should I use structured state extraction or summarization for voice memory?

Structured state extraction wins for any bot that needs reliable recall. Summarization works for casual support, but summaries lose detail and feel imprecise. The post's trade-off matrix is clear: structured state is the only strategy that delivers both low latency and perfect recall for long conversations. Give the agent a tool to write JSON, keep the prompt tiny, and never lose data again.

Why does my voice bot forget early information in long calls?

The Goldfish Effect happens with sliding windows: you keep only the last 3-4 turns to stay fast, so context from minute 1 disappears by minute 3. To fix this without bloating latency, use async summarization (a background LLM updates a summary in the system prompt) or structured state extraction (tools offload facts to JSON). The post explains when each trade-off makes sense.

For the full reference, see the LiveKit documentation.

Key takeaways

  • Context is latency in voice: Every token adds milliseconds to response time, making memory management critical for sub-500ms latency
  • Sliding windows trade recall for speed: Keeping only recent turns enables fast responses but causes the "Goldfish Effect" where early context is lost
  • Async summarization preserves facts: Background processes can compress long conversations into key facts without blocking the main audio stream
  • Structured state is the most reliable: Using tools to extract and store data in structured formats (Pydantic models) provides precise memory without token bloat
  • Different strategies for different use cases: Casual chat needs speed (sliding window), support needs facts (summarization), transactional needs precision (structured state)
  • Memory tools enable precise extraction: Giving agents tools like update_order or save_profile lets them offload memory to structured objects
  • System prompts can hold compressed context: Injecting structured state summaries into system prompts keeps context small but accurate

For more on voice AI systems, see our voice AI fundamentals guide and our streaming guide.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.