Loading...
Loading...
Build a real-time phone agent in one FastAPI process. Stream audio over WebRTC with FastRTC, transcribe with Whisper, think with your LLM of choice, speak back with streaming TTS, and handle barge-in like a human would.
Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.
Engineers are learning here from
Build a real-time phone voice agent that streams low-latency audio over WebRTC using FastRTC, pipes it through an LLM, and speaks back with natural-sounding TTS. Single FastAPI process, swappable providers, barge-in aware.
Ship a real-time phone voice agent over WebRTC with FastRTC, Whisper, and swappable TTS.
What you'll ship
What you'll learn
Curriculum
First connection
Boot the FastAPI app, confirm the FastRTC signaling path, and keep the channel alive
Text turn baseline
Run the whole agent loop in text mode first, then do the latency math the audio loop has to hit
WebRTC mount
Stream real audio frames through FastRTC and understand the SDP and ICE dance that makes it possible
STT seam
Plug Whisper streaming into the transcribe seam, gate it with VAD, and use partial transcripts to start thinking early
TTS seam
Wire a real TTS provider into the synthesize seam and stream chunks so playback starts fast
Phone persona
Shape a system prompt that produces short spoken replies and add prosody hints that make the agent feel human
Tool calling
Let the agent look up flight status mid-call and cover the lookup latency with a holder phrase so the caller never hears dead air
Barge-in and turn detection
Let the caller interrupt mid-reply, cancel the agent cleanly, and reset state so the next turn is coherent
Who it's for
who built a streaming chatbot and now need to answer an actual phone call
who understand LLMs but have never shipped real-time audio over WebRTC
who want one Python process instead of stitching together Twilio, a media server, and a separate agent host
FAQ
FastRTC keeps the entire voice loop inside a single FastAPI process. No separate media server, no extra signaling hop, no SIP trunk to configure. LiveKit is excellent when you need multi-participant rooms and server-side recording, and you will understand exactly when to reach for it after this course.
No. The agent answers from a browser over WebRTC, which is how most production voice UX starts today. You can add a SIP bridge later, but the hard part is the audio loop and the latency budget, not the phone number.
The default LLM is whatever OpenRouter routes you to, and you can swap in Gemini, Fireworks, or OpenAI with one environment variable. STT uses OpenAI Whisper when a key is present, with a mock fallback so the server boots with zero keys. TTS is a swappable seam so you can plug in ElevenLabs, Deepgram Aura, or Edge-TTS.
The LiveKit course shows you how to build agents on top of a managed real-time infrastructure. This course goes one layer down and shows you how the real-time loop actually works, inside one process, so you can debug latency and interruption behavior instead of treating them as magic.
Pricing
Subscribe to Pro for every paid course, or buy just this one.
Unlock this course and every paid course plus workshop replays. One subscription.
You save 54% with regional pricing
One-time purchase. Lifetime access to every lesson, exercise, and update.
You save 47% with regional pricing
Still deciding? Ask Param a question
Real-time phone agents with FastRTC
$79 one-time