SSE and WebSockets for real-time AI APIs

What is The challenge?

Your chat app's users start typing faster than your LLM replies.

Requests pile up
Some clients disconnect mid-generation
Metrics show spikes in open connections and token waste

Discussion: You're streaming tokens, why does latency still feel high?

What does 1. streaming is a system design problem, not a transport choice look like?

Most engineers treat streaming as "pick SSE or WebSocket."

In reality, it's a system-wide coordination model.

flowchart LR
    A[User Input] --> B[Gateway]
    B --> C[Coordinator]
    C --> D[Model Runner]
    D --> E[Streamer]
    E --> F[Client UI]
    C --> G[Metrics & Backpressure]

Every component influences perceived streaming speed.

Streaming = how data flows through your architecture, not just the protocol.

What does 2. what is the latency stack? look like?

Layer	Typical Delay (ms)	Design Lever
Model first token	300, 2000	Prompt trimming, model choice
Token serialization	10, 50 per token	Buffered vs unbuffered I/O
Gateway routing	20, 100	Keep-alive, chunk flush
Client render	30, 100	Frame batching

Challenge: How can you deliver the first visible token < 250ms, even if total generation = 5s?

(Hint: predictive buffering and async render.)

What does 3. what are the protocol trade-offs? look like?

Server-sent events (SSE)

Unidirectional HTTP stream.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: POST /chat
    loop tokens
        Server-->>Client: event: message\n data: token
    end
    Server-->>Client: event: done

Pros: Simple, cache-friendly

Cons: One-way, limited reconnection control

Websockets

Bidirectional persistent connection.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: connect(ws)
    Client->>Server: prompt
    loop tokens
        Server-->>Client: token
    end
    Client->>Server: stop / feedback

Pros: Great for agents, interactive tools

Cons: Stateful, requires connection management + load balancing

Hybrid model (real systems)

graph TD
    A[Frontend] -->|prompt| B[HTTP Gateway]
    B -->|subscribe| C[Message Broker]
    C --> D[Streamer Service]
    D -->|SSE→UI| E[User]
    A -->|control via WS| B

Control plane (WebSocket): cancel, feedback
Data plane (SSE): token flow

This hybrid is used in Anthropic, OpenAI, and Claude-style UIs.

4. How do you design for backpressure and buffer control?

When the model produces faster than clients consume:

graph LR
    M[Model Stream] -->|tokens| Q[Bounded Buffer]
    Q -->|flush chunks| C[Client]
    Q -->|pressure signal| M

Patterns:

Dynamic chunk size (flush every N tokens or T ms)
Drop policy for late clients
Stream heartbeat: event:ping

Challenge: How would you prevent OOM if one user leaves a tab open and never reads from the stream?

5. Resilience & reconnection strategy

Streaming is fragile, clients disconnect often.

Reconnection Checklist:

Client sends Last-Event-ID
Server resumes from token index or summary
Partial state persisted (ephemeral store / Redis)

sequenceDiagram
    participant C as Client
    participant S as Server
    
    C->>S: connect (Last-Event-ID=250)
    S-->>C: resume token 251…

Think of it like "TCP for tokens."

6. Observability in streaming systems

You need continuous metrics, not post-hoc logs.

flowchart TD
    A[Streamer] --> B[Metrics]
    B --> C[Dashboard]
    B --> D[Alerting]

Metrics to collect:

First token latency
Tokens/sec throughput
Stream duration distribution
Error rate (4xx, 5xx, disconnects)

Visualize these in real time, latency histograms tell you where UX pain hides.

7. Challenge example: AI pair coder streaming

Scenario: A live pair coder streams code suggestions as you type.

Goals:

Show first token < 200ms
Allow "stop generation" feedback
Resume after disconnection

graph TD
    A[IDE Plugin] -->|HTTP prompt| B[Gateway]
    B --> C[Model Runner]
    C --> D[Streamer]
    D -->|SSE→IDE| A
    A -->|WS feedback: stop| B
    D --> E[Telemetry]

Design questions:

What's your maximum open connections per instance?
How do you cancel a token stream mid-flight without leaking GPU cycles?
Can you back-off stream rate when client CPU usage spikes?

8. Optimizing perceived latency

Perceived latency ≠ actual latency.

Users judge responsiveness by first visible token and smoothness.

Design tricks:

Emit "typing animation" placeholders before tokens arrive
Send short prefix predictions fast, then stream full response
Adapt chunk size to user network speed

That's why Anthropic and OpenAI feel fast even on slow models.

9. Architectural checklist for streaming readiness

Tokenized buffering and flush control
Heartbeat and graceful close
Reconnection support (Last-Event-ID)
Per-stream metrics (first token, duration, tokens/sec)
Hybrid data/control channels
Cancellation and backpressure design

If any of these are missing → expect timeouts and frustrated users.

Discussion prompts for engineers

Would you prioritize throughput or perceived latency for a consumer AI chat app?
Where would you place buffer boundaries in a multi-model chain?
How could you simulate network jitter and measure UX degradation?
If you had to choose one metric to optimize, would it be first-token latency or completion latency? Why?

Takeaway

Real-time AI systems aren't about protocols; they're about flow discipline.
Streaming forces you to engineer for asynchrony, faults, and human perception.

You're no longer serving responses, you're orchestrating continuous conversations between humans and models.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

For a hands-on path through this topic, see AI Agents Fundamentals.

For a deeper hands-on path, see AI Agent Design Patterns.

Key takeaways

The pattern described above addresses a specific production failure mode that naive implementations miss.
Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

SSE and WebSocket aren't either/or choices for streaming LLM responses. SSE is simpler and cache-friendly for one-way token flow, but WebSocket gives you bidirectional control for cancellation and feedback. Production systems like Anthropic and OpenAI use both: WebSocket for control and SSE for data. Pick your primary constraint first: stateless simplicity or interactive responsiveness.

How do I reduce first token latency when streaming LLM responses?

First token latency is system-wide, not just model latency. The post identifies bottlenecks: model time (300-2000ms), token serialization (10-50ms per), gateway routing (20-100ms), and client render (30-100ms). Optimize in order: prompt trimming and model choice first, then unbuffered I/O serialization, then gateway keep-alive and chunk flushing. Perceived latency matters more than absolute time, so emit typing placeholders before real tokens arrive.

What happens when a streaming connection drops mid-generation and how do I recover?

Token streams are fragile. Clients disconnect constantly. Recovery requires three pieces: client-side resumes using Last-Event-ID (the token index), server-side resumes from ephemeral state (Redis or memory), and a clear drop policy (discard or re-send buffered tokens?). Without this, dropped connections waste GPU cycles and create poor user experience. Heartbeat signals and graceful close handling prevent silent failures.

For the full reference, see the Server-Sent Events spec.

What is The challenge?

Your chat app's users start typing faster than your LLM replies.

Requests pile up
Some clients disconnect mid-generation
Metrics show spikes in open connections and token waste

Discussion: You're streaming tokens, why does latency still feel high?

What does 1. streaming is a system design problem, not a transport choice look like?

Most engineers treat streaming as "pick SSE or WebSocket."

In reality, it's a system-wide coordination model.

flowchart LR
    A[User Input] --> B[Gateway]
    B --> C[Coordinator]
    C --> D[Model Runner]
    D --> E[Streamer]
    E --> F[Client UI]
    C --> G[Metrics & Backpressure]

Every component influences perceived streaming speed.

Streaming = how data flows through your architecture, not just the protocol.

What does 2. what is the latency stack? look like?

Layer	Typical Delay (ms)	Design Lever
Model first token	300, 2000	Prompt trimming, model choice
Token serialization	10, 50 per token	Buffered vs unbuffered I/O
Gateway routing	20, 100	Keep-alive, chunk flush
Client render	30, 100	Frame batching

Challenge: How can you deliver the first visible token < 250ms, even if total generation = 5s?

(Hint: predictive buffering and async render.)

What does 3. what are the protocol trade-offs? look like?

Server-sent events (SSE)

Unidirectional HTTP stream.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: POST /chat
    loop tokens
        Server-->>Client: event: message\n data: token
    end
    Server-->>Client: event: done

Pros: Simple, cache-friendly

Cons: One-way, limited reconnection control

Websockets

Bidirectional persistent connection.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: connect(ws)
    Client->>Server: prompt
    loop tokens
        Server-->>Client: token
    end
    Client->>Server: stop / feedback

Pros: Great for agents, interactive tools

Cons: Stateful, requires connection management + load balancing

Hybrid model (real systems)

graph TD
    A[Frontend] -->|prompt| B[HTTP Gateway]
    B -->|subscribe| C[Message Broker]
    C --> D[Streamer Service]
    D -->|SSE→UI| E[User]
    A -->|control via WS| B

Control plane (WebSocket): cancel, feedback
Data plane (SSE): token flow

This hybrid is used in Anthropic, OpenAI, and Claude-style UIs.

4. How do you design for backpressure and buffer control?

When the model produces faster than clients consume:

graph LR
    M[Model Stream] -->|tokens| Q[Bounded Buffer]
    Q -->|flush chunks| C[Client]
    Q -->|pressure signal| M

Patterns:

Dynamic chunk size (flush every N tokens or T ms)
Drop policy for late clients
Stream heartbeat: event:ping

Challenge: How would you prevent OOM if one user leaves a tab open and never reads from the stream?

5. Resilience & reconnection strategy

Streaming is fragile, clients disconnect often.

Reconnection Checklist:

Client sends Last-Event-ID
Server resumes from token index or summary
Partial state persisted (ephemeral store / Redis)

sequenceDiagram
    participant C as Client
    participant S as Server
    
    C->>S: connect (Last-Event-ID=250)
    S-->>C: resume token 251…

Think of it like "TCP for tokens."

6. Observability in streaming systems

You need continuous metrics, not post-hoc logs.

flowchart TD
    A[Streamer] --> B[Metrics]
    B --> C[Dashboard]
    B --> D[Alerting]

Metrics to collect:

First token latency
Tokens/sec throughput
Stream duration distribution
Error rate (4xx, 5xx, disconnects)

Visualize these in real time, latency histograms tell you where UX pain hides.

7. Challenge example: AI pair coder streaming

Scenario: A live pair coder streams code suggestions as you type.

Goals:

Show first token < 200ms
Allow "stop generation" feedback
Resume after disconnection

graph TD
    A[IDE Plugin] -->|HTTP prompt| B[Gateway]
    B --> C[Model Runner]
    C --> D[Streamer]
    D -->|SSE→IDE| A
    A -->|WS feedback: stop| B
    D --> E[Telemetry]

Design questions:

What's your maximum open connections per instance?
How do you cancel a token stream mid-flight without leaking GPU cycles?
Can you back-off stream rate when client CPU usage spikes?

8. Optimizing perceived latency

Perceived latency ≠ actual latency.

Users judge responsiveness by first visible token and smoothness.

Design tricks:

Emit "typing animation" placeholders before tokens arrive
Send short prefix predictions fast, then stream full response
Adapt chunk size to user network speed

That's why Anthropic and OpenAI feel fast even on slow models.

9. Architectural checklist for streaming readiness

Tokenized buffering and flush control
Heartbeat and graceful close
Reconnection support (Last-Event-ID)
Per-stream metrics (first token, duration, tokens/sec)
Hybrid data/control channels
Cancellation and backpressure design

If any of these are missing → expect timeouts and frustrated users.

Discussion prompts for engineers

Would you prioritize throughput or perceived latency for a consumer AI chat app?
Where would you place buffer boundaries in a multi-model chain?
How could you simulate network jitter and measure UX degradation?
If you had to choose one metric to optimize, would it be first-token latency or completion latency? Why?

Takeaway

Real-time AI systems aren't about protocols; they're about flow discipline.
Streaming forces you to engineer for asynchrony, faults, and human perception.

You're no longer serving responses, you're orchestrating continuous conversations between humans and models.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

For a hands-on path through this topic, see AI Agents Fundamentals.

For a deeper hands-on path, see AI Agent Design Patterns.

Key takeaways

The pattern described above addresses a specific production failure mode that naive implementations miss.
Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

Streaming at scale: SSE and WebSockets for AI

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Testing and evaluating RAG pipelines end to end

Weekly Bytes of AI

Ready to go deeper?

Streaming at scale: SSE and WebSockets for AI

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Testing and evaluating RAG pipelines end to end

Weekly Bytes of AI

Ready to go deeper?