Loading...
Loading...
Your laptop demo is not a deploy. Take a working RAG API, parallelize embedding with Ray, persist vectors in ChromaDB, containerize the service, and roll it out to Kubernetes with probes, ConfigMap, Secret, Ingress, and an HPA. You will watch replicas scale during a load test.
Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.
Engineers are learning here from
Take a single-process RAG service and turn it into a horizontally scalable Kubernetes deployment. Parallelize embedding with Ray actors, persist vectors in ChromaDB, package the API in a multi-stage Dockerfile, and roll it out with probes, ConfigMap, Secret, Ingress, and an HPA.
Ship a RAG service that survives real production traffic on Kubernetes.
What you'll ship
What you'll learn
Curriculum
Shape the API
Stand up the FastAPI router with ingest, query, health, and stats and pin the contracts with Pydantic
Single-process embedding
Wire SentenceTransformer and ChromaDB end to end so ingest and query work before we add Ray
Parallel embedding with Ray
Fan embedding work across Ray actors and measure the speedup against the sequential baseline
Grounded answers
Retrieve top-k context from ChromaDB and synthesize answers with a strict context-only prompt
Containerize the service
Build a multi-stage Docker image that caches the embedding model and runs as a non-root user
Deploy to Kubernetes
Apply Deployment, Service, and Ingress manifests with readiness and liveness probes
Secrets and ConfigMap
Split non-secret env into a ConfigMap and the LLM API key into a Secret so rotation is safe
Horizontal scale and observability
Wire an HPA, run a load test against ingest, and add observability hooks so you can see the cluster scale
Who it's for
whose demos melt the moment they hit real traffic because embedding is single-threaded and the index lives in memory
who know Docker but have never wired probes, ConfigMap, Secret, Ingress, and an HPA for an ML workload
who need to operate a grounded retrieval service that scales horizontally and survives a pod restart
FAQ
No. The workshop runs on any local cluster including kind, minikube, k3d, or Docker Desktop. Every manifest works the same way on GKE, EKS, or AKS later.
No. Ray is lazy-imported. If Ray is missing the embedder falls back to a single-process SentenceTransformer with the same interface. You will build both paths and compare them.
ChromaDB with a persistent client. It runs inside the pod and stores vectors on a mounted volume. The same code works against a managed vector store by swapping the index layer.
The course uses the OpenRouter provider by default and the code supports Fireworks, Gemini, and OpenAI through an env switch. You only need one key to follow along.
Pricing
Subscribe to Pro for every paid course, or buy just this one.
Unlock this course and every paid course plus workshop replays. One subscription.
You save 54% with regional pricing
One-time purchase. Lifetime access to every lesson, exercise, and update.
You save 47% with regional pricing
Still deciding? Ask Param a question
Enterprise RAG infrastructure with Kubernetes and Ray
$79 one-time