Do I need a telephony provider to build voice agents?

Not to start. LiveKit and FastRTC let you build and test voice agents entirely in the browser with WebRTC. Adding a phone number (Twilio or Telnyx) comes later. The real-time phone agents course shows that integration explicitly.

Whisper or Deepgram for speech recognition?

Whisper is the open-source baseline. Runs locally, no per-minute cost, lower accuracy on noisy audio. Deepgram is faster and more accurate on real-world audio but adds a provider dependency. Voice agents in production usually pick Deepgram; indie projects often stick with Whisper.

How do I stop the agent from talking over the user?

Turn detection. Each framework ships a voice activity detector that pauses TTS the moment it hears speech. The LiveKit courses cover the specific knobs (silence thresholds, interruption handling) that make conversations feel natural instead of rude.

Can one voice agent hand off to another mid-call?

Yes, and this is where voice AI gets interesting. The multi-agent voice systems course walks through a triage-to-specialist handoff where context and audio session stay continuous. The user never hears the seam.

What is the realistic latency budget?

Aim for under 800 ms end-to-end for a natural feel. That is STT plus LLM plus TTS plus network. Streaming helps a lot on the LLM side, and keeping the graph shallow keeps total time predictable.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Skill track

Voice AI courses

Voice AI sits at the intersection of three things that are each hard alone: real-time audio, speech recognition, and agent logic that responds in under a second. When any one of those drifts, the whole experience feels broken. The engineering is latency budgets, turn detection, context preservation across handoffs, and knowing when to interrupt.

Curated by Param Harrison

These courses cover the full stack most voice AI teams use: LiveKit and FastRTC for transport, Whisper and Deepgram for STT, TTS providers for the response voice, and agent frameworks that tie it together. You build things that sound like real products, not demo reels, with specific patterns for multi-agent voice triage and phone-line integration.

Showing 4 of 4 courses

Building voice AI agents with LiveKit and Deepgram

Build a real-time voice AI agent with LiveKit, Deepgram, and tool calling.

IntermediatePro

View course

Multi-agent voice systems with LiveKit

Orchestrate specialized voice agents that hand off conversations without losing context.

AdvancedPro

View course

Local voice transcription with Whisper and LLM post-processing

Record audio in the browser, transcribe locally with Whisper, and clean output with an LLM pipeline.

IntermediatePro

View course

Real-time phone agents with FastRTC

Ship a real-time phone voice agent over WebRTC with FastRTC, Whisper, and swappable TTS.

AdvancedPro

View course

Common questions

Voice AI: quick answers

Do I need a telephony provider to build voice agents?
Not to start. LiveKit and FastRTC let you build and test voice agents entirely in the browser with WebRTC. Adding a phone number (Twilio or Telnyx) comes later. The real-time phone agents course shows that integration explicitly.
Whisper or Deepgram for speech recognition?
Whisper is the open-source baseline. Runs locally, no per-minute cost, lower accuracy on noisy audio. Deepgram is faster and more accurate on real-world audio but adds a provider dependency. Voice agents in production usually pick Deepgram; indie projects often stick with Whisper.
How do I stop the agent from talking over the user?
Turn detection. Each framework ships a voice activity detector that pauses TTS the moment it hears speech. The LiveKit courses cover the specific knobs (silence thresholds, interruption handling) that make conversations feel natural instead of rude.
Can one voice agent hand off to another mid-call?
Yes, and this is where voice AI gets interesting. The multi-agent voice systems course walks through a triage-to-specialist handoff where context and audio session stay continuous. The user never hears the seam.
What is the realistic latency budget?
Aim for under 800 ms end-to-end for a natural feel. That is STT plus LLM plus TTS plus network. Streaming helps a lot on the LLM side, and keeping the graph shallow keeps total time predictable.

Or browse every course

Voice AI courses

Building voice AI agents with LiveKit and Deepgram

Multi-agent voice systems with LiveKit

Local voice transcription with Whisper and LLM post-processing

Real-time phone agents with FastRTC

Voice AI: quick answers

Do I need a telephony provider to build voice agents?

Whisper or Deepgram for speech recognition?

How do I stop the agent from talking over the user?

Can one voice agent hand off to another mid-call?

What is the realistic latency budget?

Related paths

Voice AI courses

Building voice AI agents with LiveKit and Deepgram

Multi-agent voice systems with LiveKit

Local voice transcription with Whisper and LLM post-processing

Real-time phone agents with FastRTC

Voice AI: quick answers

Do I need a telephony provider to build voice agents?

Whisper or Deepgram for speech recognition?

How do I stop the agent from talking over the user?

Can one voice agent hand off to another mid-call?

What is the realistic latency budget?

Related paths

Voice AI courses

Create your free account

Building voice AI agents with LiveKit and Deepgram

Multi-agent voice systems with LiveKit

Local voice transcription with Whisper and LLM post-processing

Real-time phone agents with FastRTC

Voice AI: quick answers

Do I need a telephony provider to build voice agents?

Whisper or Deepgram for speech recognition?

How do I stop the agent from talking over the user?

Can one voice agent hand off to another mid-call?

What is the realistic latency budget?

Related paths

Voice AI courses

Create your free account

Building voice AI agents with LiveKit and Deepgram

Multi-agent voice systems with LiveKit

Local voice transcription with Whisper and LLM post-processing

Real-time phone agents with FastRTC

Voice AI: quick answers

Do I need a telephony provider to build voice agents?

Whisper or Deepgram for speech recognition?

How do I stop the agent from talking over the user?

Can one voice agent hand off to another mid-call?

What is the realistic latency budget?

Related paths