Streaming at scale: SSE and WebSockets for AI
What is The challenge?
Your chat app's users start typing faster than your LLM replies.
- Requests pile up
- Some clients disconnect mid-generation
- Metrics show spikes in open connections and token waste
Discussion: You're streaming tokens, why does latency still feel high?
What does 1. streaming is a system design problem, not a transport choice look like?
Most engineers treat streaming as "pick SSE or WebSocket."
In reality, it's a system-wide coordination model.
flowchart LR
A[User Input] --> B[Gateway]
B --> C[Coordinator]
C --> D[Model Runner]
D --> E[Streamer]
E --> F[Client UI]
C --> G[Metrics & Backpressure]
Every component influences perceived streaming speed.
Streaming = how data flows through your architecture, not just the protocol.
What does 2. what is the latency stack? look like?
| Layer | Typical Delay (ms) | Design Lever |
|---|---|---|
| Model first token | 300, 2000 | Prompt trimming, model choice |
| Token serialization | 10, 50 per token | Buffered vs unbuffered I/O |
| Gateway routing | 20, 100 | Keep-alive, chunk flush |
| Client render | 30, 100 | Frame batching |
Challenge: How can you deliver the first visible token < 250ms, even if total generation = 5s?
(Hint: predictive buffering and async render.)
What does 3. what are the protocol trade-offs? look like?
Server-sent events (SSE)
Unidirectional HTTP stream.
sequenceDiagram
participant Client
participant Server
Client->>Server: POST /chat
loop tokens
Server-->>Client: event: message\n data: token
end
Server-->>Client: event: done
Pros: Simple, cache-friendly
Cons: One-way, limited reconnection control
Websockets
Bidirectional persistent connection.
sequenceDiagram
participant Client
participant Server
Client->>Server: connect(ws)
Client->>Server: prompt
loop tokens
Server-->>Client: token
end
Client->>Server: stop / feedback
Pros: Great for agents, interactive tools
Cons: Stateful, requires connection management + load balancing
Hybrid model (real systems)
graph TD
A[Frontend] -->|prompt| B[HTTP Gateway]
B -->|subscribe| C[Message Broker]
C --> D[Streamer Service]
D -->|SSE→UI| E[User]
A -->|control via WS| B
- Control plane (WebSocket): cancel, feedback
- Data plane (SSE): token flow
This hybrid is used in Anthropic, OpenAI, and Claude-style UIs.
4. How do you design for backpressure and buffer control?
When the model produces faster than clients consume:
graph LR
M[Model Stream] -->|tokens| Q[Bounded Buffer]
Q -->|flush chunks| C[Client]
Q -->|pressure signal| M
Patterns:
- Dynamic chunk size (flush every N tokens or T ms)
- Drop policy for late clients
- Stream heartbeat:
event:ping
Challenge: How would you prevent OOM if one user leaves a tab open and never reads from the stream?
5. Resilience & reconnection strategy
Streaming is fragile, clients disconnect often.
Reconnection Checklist:
- Client sends
Last-Event-ID - Server resumes from token index or summary
- Partial state persisted (ephemeral store / Redis)
sequenceDiagram
participant C as Client
participant S as Server
C->>S: connect (Last-Event-ID=250)
S-->>C: resume token 251…
Think of it like "TCP for tokens."
6. Observability in streaming systems
You need continuous metrics, not post-hoc logs.
flowchart TD
A[Streamer] --> B[Metrics]
B --> C[Dashboard]
B --> D[Alerting]
Metrics to collect:
- First token latency
- Tokens/sec throughput
- Stream duration distribution
- Error rate (4xx, 5xx, disconnects)
Visualize these in real time, latency histograms tell you where UX pain hides.
7. Challenge example: AI pair coder streaming
Scenario: A live pair coder streams code suggestions as you type.
Goals:
- Show first token < 200ms
- Allow "stop generation" feedback
- Resume after disconnection
graph TD
A[IDE Plugin] -->|HTTP prompt| B[Gateway]
B --> C[Model Runner]
C --> D[Streamer]
D -->|SSE→IDE| A
A -->|WS feedback: stop| B
D --> E[Telemetry]
Design questions:
- What's your maximum open connections per instance?
- How do you cancel a token stream mid-flight without leaking GPU cycles?
- Can you back-off stream rate when client CPU usage spikes?
8. Optimizing perceived latency
Perceived latency ≠ actual latency.
Users judge responsiveness by first visible token and smoothness.
Design tricks:
- Emit "typing animation" placeholders before tokens arrive
- Send short prefix predictions fast, then stream full response
- Adapt chunk size to user network speed
That's why Anthropic and OpenAI feel fast even on slow models.
9. Architectural checklist for streaming readiness
- Tokenized buffering and flush control
- Heartbeat and graceful close
- Reconnection support (Last-Event-ID)
- Per-stream metrics (first token, duration, tokens/sec)
- Hybrid data/control channels
- Cancellation and backpressure design
If any of these are missing → expect timeouts and frustrated users.
Discussion prompts for engineers
- Would you prioritize throughput or perceived latency for a consumer AI chat app?
- Where would you place buffer boundaries in a multi-model chain?
- How could you simulate network jitter and measure UX degradation?
- If you had to choose one metric to optimize, would it be first-token latency or completion latency? Why?
Takeaway
- Real-time AI systems aren't about protocols; they're about flow discipline.
- Streaming forces you to engineer for asynchrony, faults, and human perception.
You're no longer serving responses, you're orchestrating continuous conversations between humans and models.
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.
For a hands-on path through this topic, see AI Agents Fundamentals.
For a deeper hands-on path, see AI Agent Design Patterns.
Key takeaways
- The pattern described above addresses a specific production failure mode that naive implementations miss.
- Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
- Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
- To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.
Frequently asked questions
Should I use WebSocket or SSE for streaming LLM responses?
SSE and WebSocket aren't either/or choices for streaming LLM responses. SSE is simpler and cache-friendly for one-way token flow, but WebSocket gives you bidirectional control for cancellation and feedback. Production systems like Anthropic and OpenAI use both: WebSocket for control and SSE for data. Pick your primary constraint first: stateless simplicity or interactive responsiveness.
How do I reduce first token latency when streaming LLM responses?
First token latency is system-wide, not just model latency. The post identifies bottlenecks: model time (300-2000ms), token serialization (10-50ms per), gateway routing (20-100ms), and client render (30-100ms). Optimize in order: prompt trimming and model choice first, then unbuffered I/O serialization, then gateway keep-alive and chunk flushing. Perceived latency matters more than absolute time, so emit typing placeholders before real tokens arrive.
What happens when a streaming connection drops mid-generation and how do I recover?
Token streams are fragile. Clients disconnect constantly. Recovery requires three pieces: client-side resumes using Last-Event-ID (the token index), server-side resumes from ephemeral state (Redis or memory), and a clear drop policy (discard or re-send buffered tokens?). Without this, dropped connections waste GPU cycles and create poor user experience. Heartbeat signals and graceful close handling prevent silent failures.
For the full reference, see the Server-Sent Events spec.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.