Is this only for teams that need 100K queries per second?

No. The patterns scale up and down. Run the same setup at 100 queries per second on a single GPU, or scale to many replicas with HPA. The 100K test exists to prove the architecture, not as a requirement.

Can I follow this on AWS or Azure instead of GCP?

Yes. The Docker image and Kubernetes manifests are cloud-agnostic. The GCP-specific section is one short module on cluster creation. EKS or AKS need slightly different cluster commands but the rest is identical.

Do I need a multi-GPU node?

For the labs, a single A10G or L4 GPU is enough for a 3B model after W4A16 quantization. Production multi-GPU is covered in the deployment module but not required to follow along.

How is this different from Enterprise RAG with Kubernetes and Ray?

Enterprise RAG focuses on scaling RAG pipelines with Ray. This course focuses on serving the model itself: quantization, KV cache, FastAPI, K8s. The two are complementary; this is the model-serving layer that sits underneath the RAG layer.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

When the API bill blows up, you self-host

Name: Serving LLMs at scale with vLLM, quantization, and Kubernetes
Price: 49 USD
Availability: InStock

Most teams stay on managed APIs. Some discover their workload is too high-volume, too cost-sensitive, or too privacy-bound to keep paying per token. This course is the production self-serve path: quantize the model, optimize inference, deploy on Kubernetes, and prove it survives a 100K-query stress test.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Self-host an LLM that holds up under 100K parallel queries. Quantize the weights with W4A16 and W8A8, optimize KV cache and speculative decoding, wrap it in FastAPI, containerize it with Docker, and deploy on a GCP-hosted Kubernetes cluster. A guided curriculum that walks the production path from a baseline 3B model to a stress-tested service.

When the API bill outgrows a managed provider, you self-host. Here is the production path that actually scales.

What you'll ship

Real projects, not toy demos.

A baseline 3B LLaMA inference setup with measured latency and memory usage
A W4A16-quantized version that fits 4x more requests per GPU
A KV-cache-optimized inference loop that handles long inputs without OOM
A speculative-decoding setup that pairs the main model with a draft assistant
A FastAPI server that exposes the optimized model behind a /generate endpoint
A Docker image plus Kubernetes deployment, service, and HPA manifests
A 100K-query Locust stress test that proves the stack actually scales on GCP

What you'll learn

You finish able to:

Quantify the latency, memory, and accuracy of a baseline 3B LLM under load
Apply W4A16, W8A8, and W4A8 quantization and pick the right one for your budget
Wire KV cache, paged attention, and speculative decoding into the inference loop
Build a FastAPI server that loads the optimized model and exposes a /generate endpoint
Containerize the server with Docker and deploy it on a GPU-enabled Kubernetes cluster
Configure HPA to scale based on real demand signals rather than CPU
Run a 100K-query Locust stress test against the deployed service and read the results

Curriculum

The LLM serving curriculum

01
The serving cost problem
Why teams self-host, what a baseline 3B model actually costs in latency and memory, where each optimization fits.
3 lessons
02
Weight quantization
W4A16, W8A8, W4A8 trade-offs measured on real evaluation data. Pick the right precision for your accuracy budget.
3 lessons
03
Inference optimization
KV cache, paged attention, prompt-lookup decoding, speculative decoding. Each one cuts latency at long inputs.
3 lessons
04
Production deployment
FastAPI server, Docker image, Kubernetes deployment plus HPA, GCP cluster setup, and a 100K-query Locust stress test.
3 lessons

Who it's for

Is this for you?

Engineers whose model bill is the largest line item

The finance team wants to know why your inference cost grew 3x this quarter. You know self-hosting is cheaper at your volume, you just have not figured out the production path.

AI engineers with a privacy or latency requirement that rules out managed APIs

Your data cannot leave the VPC, or your latency budget cannot tolerate the network hop to a managed provider. You need self-hosting that actually scales.

Platform engineers building shared inference infrastructure

Multiple teams want to deploy fine-tuned models. You need a reference architecture for FastAPI + GPU + Kubernetes that just works.

FAQ

Common questions.

Is this only for teams that need 100K queries per second?
No. The patterns scale up and down. Run the same setup at 100 queries per second on a single GPU, or scale to many replicas with HPA. The 100K test exists to prove the architecture, not as a requirement.
Can I follow this on AWS or Azure instead of GCP?
Yes. The Docker image and Kubernetes manifests are cloud-agnostic. The GCP-specific section is one short module on cluster creation. EKS or AKS need slightly different cluster commands but the rest is identical.
Do I need a multi-GPU node?
For the labs, a single A10G or L4 GPU is enough for a 3B model after W4A16 quantization. Production multi-GPU is covered in the deployment module but not required to follow along.
How is this different from Enterprise RAG with Kubernetes and Ray?
Enterprise RAG focuses on scaling RAG pipelines with Ray. This course focuses on serving the model itself: quantization, KV cache, FastAPI, K8s. The two are complementary; this is the model-serving layer that sits underneath the RAG layer.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

Stop paying per token. Start paying per GPU-hour.

Self-hosting is not magic. It is a stack of optimizations, each one quantified.

Enroll

Serving LLMs at scale with vLLM, quantization, and Kubernetes

From $16/mo with Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

When the API bill blows up, you self-host

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

When the API bill outgrows a managed provider, you self-host. Here is the production path that actually scales.

What you'll ship

Real projects, not toy demos.

A baseline 3B LLaMA inference setup with measured latency and memory usage
A W4A16-quantized version that fits 4x more requests per GPU
A KV-cache-optimized inference loop that handles long inputs without OOM
A speculative-decoding setup that pairs the main model with a draft assistant
A FastAPI server that exposes the optimized model behind a /generate endpoint
A Docker image plus Kubernetes deployment, service, and HPA manifests
A 100K-query Locust stress test that proves the stack actually scales on GCP

What you'll learn

You finish able to:

Quantify the latency, memory, and accuracy of a baseline 3B LLM under load
Apply W4A16, W8A8, and W4A8 quantization and pick the right one for your budget
Wire KV cache, paged attention, and speculative decoding into the inference loop
Build a FastAPI server that loads the optimized model and exposes a /generate endpoint
Containerize the server with Docker and deploy it on a GPU-enabled Kubernetes cluster
Configure HPA to scale based on real demand signals rather than CPU
Run a 100K-query Locust stress test against the deployed service and read the results

Curriculum

The LLM serving curriculum

01
The serving cost problem
Why teams self-host, what a baseline 3B model actually costs in latency and memory, where each optimization fits.
3 lessons
02
Weight quantization
W4A16, W8A8, W4A8 trade-offs measured on real evaluation data. Pick the right precision for your accuracy budget.
3 lessons
03
Inference optimization
KV cache, paged attention, prompt-lookup decoding, speculative decoding. Each one cuts latency at long inputs.
3 lessons
04
Production deployment
FastAPI server, Docker image, Kubernetes deployment plus HPA, GCP cluster setup, and a 100K-query Locust stress test.
3 lessons

Who it's for

Is this for you?

Engineers whose model bill is the largest line item

The finance team wants to know why your inference cost grew 3x this quarter. You know self-hosting is cheaper at your volume, you just have not figured out the production path.

AI engineers with a privacy or latency requirement that rules out managed APIs

Your data cannot leave the VPC, or your latency budget cannot tolerate the network hop to a managed provider. You need self-hosting that actually scales.

Platform engineers building shared inference infrastructure

Multiple teams want to deploy fine-tuned models. You need a reference architecture for FastAPI + GPU + Kubernetes that just works.

FAQ

Common questions.

Is this only for teams that need 100K queries per second?
No. The patterns scale up and down. Run the same setup at 100 queries per second on a single GPU, or scale to many replicas with HPA. The 100K test exists to prove the architecture, not as a requirement.
Can I follow this on AWS or Azure instead of GCP?
Yes. The Docker image and Kubernetes manifests are cloud-agnostic. The GCP-specific section is one short module on cluster creation. EKS or AKS need slightly different cluster commands but the rest is identical.
Do I need a multi-GPU node?
For the labs, a single A10G or L4 GPU is enough for a 3B model after W4A16 quantization. Production multi-GPU is covered in the deployment module but not required to follow along.
How is this different from Enterprise RAG with Kubernetes and Ray?
Enterprise RAG focuses on scaling RAG pipelines with Ray. This course focuses on serving the model itself: quantization, KV cache, FastAPI, K8s. The two are complementary; this is the model-serving layer that sits underneath the RAG layer.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

Stop paying per token. Start paying per GPU-hour.

Self-hosting is not magic. It is a stack of optimizations, each one quantified.

Enroll

Serving LLMs at scale with vLLM, quantization, and Kubernetes

From $16/mo with Pro