Do I need a managed Kubernetes cluster?

No. The workshop runs on any local cluster including kind, minikube, k3d, or Docker Desktop. Every manifest works the same way on GKE, EKS, or AKS later.

Is Ray required to run the service?

No. Ray is lazy-imported. If Ray is missing the embedder falls back to a single-process SentenceTransformer with the same interface. You will build both paths and compare them.

What vector database do we use?

ChromaDB with a persistent client. It runs inside the pod and stores vectors on a mounted volume. The same code works against a managed vector store by swapping the index layer.

Which LLM provider is used for answers?

The course uses the OpenRouter provider by default and the code supports Fireworks, Gemini, and OpenAI through an env switch. You only need one key to follow along.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

Ship a RAG service that survives real production traffic

Name: Enterprise RAG infrastructure with Kubernetes and Ray
Price: 24 USD
Availability: InStock

Your laptop demo is not a deploy. Take a working RAG API, parallelize embedding with Ray, persist vectors in ChromaDB, containerize the service, and roll it out to Kubernetes with probes, ConfigMap, Secret, Ingress, and an HPA. You will watch replicas scale during a load test.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Take a single-process RAG service and turn it into a horizontally scalable Kubernetes deployment. Parallelize embedding with Ray actors, persist vectors in ChromaDB, package the API in a layer-cached Dockerfile, and roll it out with probes, ConfigMap, Secret, Ingress, and an HPA.

Ship a RAG service that survives real production traffic on Kubernetes.

What you'll ship

Real projects, not toy demos.

A FastAPI RAG service with ingest, query, health, and stats endpoints
A Ray-powered embedding layer that fans work across parallel actors
A persistent ChromaDB index with idempotent upserts
A grounded answer path that cites the retrieved sources
A layer-cached Dockerfile with a baked model cache and a non-root user
Kubernetes manifests for Deployment, Service, Ingress, ConfigMap, and Secret
An HPA plus load-test walkthrough so you can see the cluster scale under pressure

What you'll learn

You finish able to:

Design a RAG API that separates ingest, query, and health so each can scale independently
Swap a single-process SentenceTransformer for a Ray-powered parallel embedder behind the same interface
Persist a ChromaDB index with idempotent upserts so ingest is safe to retry
Write grounded answer prompts that cite retrieved sources and refuse to hallucinate
Package the service in a layer-cached Dockerfile that caches the embedding model and runs as a non-root user
Apply Kubernetes Deployment, Service, and Ingress manifests with readiness and liveness probes on the health endpoint
Split config into ConfigMap and Secret so rotating the LLM key never requires a YAML edit
Configure an HPA, run a load test against ingest, and watch replicas scale up and down

Curriculum

From an empty FastAPI router to a horizontally scaled cluster.

01
Shape the API
Stand up the FastAPI router with ingest, query, health, and stats and pin the contracts with Pydantic
3 lessons
02
Single-process embedding
Wire SentenceTransformer and ChromaDB end to end so ingest and query work before we add Ray
3 lessons
03
Parallel embedding with Ray
Fan embedding work across Ray actors and measure the speedup against the sequential baseline
3 lessons
04
Grounded answers
Retrieve top-k context from ChromaDB and synthesize answers with a strict context-only prompt
3 lessons
05
Containerize the service
Build a layer-cached Docker image that caches the embedding model and runs as a non-root user
3 lessons
06
Deploy to Kubernetes
Apply Deployment, Service, and Ingress manifests with readiness and liveness probes
3 lessons
07
Secrets and ConfigMap
Split non-secret env into a ConfigMap and the LLM API key into a Secret so rotation is safe
3 lessons
08
Horizontal scale and observability
Wire an HPA, run a load test against ingest, and add observability hooks so you can see the cluster scale
3 lessons

Who it's for

Is this for you?

Python engineers shipping RAG services

whose demos melt the moment they hit real traffic because embedding is single-threaded and the index lives in memory

Backend engineers adopting Kubernetes

who know Docker but have never wired probes, ConfigMap, Secret, Ingress, and an HPA for an ML workload

AI platform engineers

who need to operate a grounded retrieval service that scales horizontally and survives a pod restart

FAQ

Common questions.

Do I need a managed Kubernetes cluster?
No. The workshop runs on any local cluster including kind, minikube, k3d, or Docker Desktop. Every manifest works the same way on GKE, EKS, or AKS later.
Is Ray required to run the service?
No. Ray is lazy-imported. If Ray is missing the embedder falls back to a single-process SentenceTransformer with the same interface. You will build both paths and compare them.
What vector database do we use?
ChromaDB with a persistent client. It runs inside the pod and stores vectors on a mounted volume. The same code works against a managed vector store by swapping the index layer.
Which LLM provider is used for answers?
The course uses the OpenRouter provider by default and the code supports Fireworks, Gemini, and OpenAI through an env switch. You only need one key to follow along.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding?

After this course:

Production RAG follows patterns. Learn them on one real deploy.

Enroll

Enterprise RAG infrastructure with Kubernetes and Ray

From $16/mo with Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

Ship a RAG service that survives real production traffic

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Ship a RAG service that survives real production traffic on Kubernetes.

What you'll ship

Real projects, not toy demos.

A FastAPI RAG service with ingest, query, health, and stats endpoints
A Ray-powered embedding layer that fans work across parallel actors
A persistent ChromaDB index with idempotent upserts
A grounded answer path that cites the retrieved sources
A layer-cached Dockerfile with a baked model cache and a non-root user
Kubernetes manifests for Deployment, Service, Ingress, ConfigMap, and Secret
An HPA plus load-test walkthrough so you can see the cluster scale under pressure

What you'll learn

You finish able to:

Design a RAG API that separates ingest, query, and health so each can scale independently
Swap a single-process SentenceTransformer for a Ray-powered parallel embedder behind the same interface
Persist a ChromaDB index with idempotent upserts so ingest is safe to retry
Write grounded answer prompts that cite retrieved sources and refuse to hallucinate
Package the service in a layer-cached Dockerfile that caches the embedding model and runs as a non-root user
Apply Kubernetes Deployment, Service, and Ingress manifests with readiness and liveness probes on the health endpoint
Split config into ConfigMap and Secret so rotating the LLM key never requires a YAML edit
Configure an HPA, run a load test against ingest, and watch replicas scale up and down

Curriculum

From an empty FastAPI router to a horizontally scaled cluster.

01
Shape the API
Stand up the FastAPI router with ingest, query, health, and stats and pin the contracts with Pydantic
3 lessons
02
Single-process embedding
Wire SentenceTransformer and ChromaDB end to end so ingest and query work before we add Ray
3 lessons
03
Parallel embedding with Ray
Fan embedding work across Ray actors and measure the speedup against the sequential baseline
3 lessons
04
Grounded answers
Retrieve top-k context from ChromaDB and synthesize answers with a strict context-only prompt
3 lessons
05
Containerize the service
Build a layer-cached Docker image that caches the embedding model and runs as a non-root user
3 lessons
06
Deploy to Kubernetes
Apply Deployment, Service, and Ingress manifests with readiness and liveness probes
3 lessons
07
Secrets and ConfigMap
Split non-secret env into a ConfigMap and the LLM API key into a Secret so rotation is safe
3 lessons
08
Horizontal scale and observability
Wire an HPA, run a load test against ingest, and add observability hooks so you can see the cluster scale
3 lessons

Who it's for

Is this for you?

Python engineers shipping RAG services

whose demos melt the moment they hit real traffic because embedding is single-threaded and the index lives in memory

Backend engineers adopting Kubernetes

who know Docker but have never wired probes, ConfigMap, Secret, Ingress, and an HPA for an ML workload

AI platform engineers

who need to operate a grounded retrieval service that scales horizontally and survives a pod restart

FAQ

Common questions.

Do I need a managed Kubernetes cluster?
No. The workshop runs on any local cluster including kind, minikube, k3d, or Docker Desktop. Every manifest works the same way on GKE, EKS, or AKS later.
Is Ray required to run the service?
No. Ray is lazy-imported. If Ray is missing the embedder falls back to a single-process SentenceTransformer with the same interface. You will build both paths and compare them.
What vector database do we use?
ChromaDB with a persistent client. It runs inside the pod and stores vectors on a mounted volume. The same code works against a managed vector store by swapping the index layer.
Which LLM provider is used for answers?
The course uses the OpenRouter provider by default and the code supports Fireworks, Gemini, and OpenAI through an env switch. You only need one key to follow along.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding?

After this course:

Production RAG follows patterns. Learn them on one real deploy.

Enroll

Enterprise RAG infrastructure with Kubernetes and Ray

From $16/mo with Pro