Your AI service image is 2.8 GB and takes 12 minutes to pull

You Dockerized your agent service the way every tutorial shows: FROM python:3.12, COPY . ., RUN pip install -r requirements.txt, CMD python main.py. It works. Then you push to a registry and your deploy takes 12 minutes because the image is 2.8 GB. Every engineer pulling it waits. Every rolling restart in production is slow. Every autoscaling event is slow.

The problem is not Docker. The problem is that the naive Dockerfile was never designed for production. AI service images carry torch, transformers, numpy, pandas, langchain, and the full transitive closure of every ML dependency. Without discipline, the image is mostly dead weight: build tools, test dependencies, source files that the runtime never reads.

This post is the layered image pattern I ship on every production AI service: multi-stage build, minimal runtime, cached dependency layer, and the 5 specific Dockerfile decisions that cut image size from 2.8 GB to 450 MB without giving up any functionality.

Why are naive AI service images so big?

Because Python ML dependencies are huge and a single-stage Dockerfile keeps everything. A naive python:3.12 image starts at 150 MB. Add torch and you are at 1.2 GB. Add transformers, sentence-transformers, numpy, pandas, and scikit-learn and you are past 2 GB. Then the naive Dockerfile keeps:

  • Build-time tools: gcc, g++, make, Python headers. Needed to compile wheels that have no prebuilt binaries, then useless at runtime.
  • Source code including tests: tests/, scripts/, fixture data. Runtime never reads them.
  • Package caches: pip's download cache, apt's lists. Kept unless explicitly cleaned.
  • Dev dependencies: pytest, ruff, mypy. Never run in production.

Every one of these bloats the image, slows pulls, and extends every deploy. Multi-stage builds let you keep all of them in a builder stage that never ships.

graph LR
    Stage1[Builder stage<br/>python:3.12 + build-essential]
    Stage2[Runtime stage<br/>python:3.12-slim, minimal]

    Stage1 -->|pip install --prefix=/install| Wheels[/install dir with compiled wheels/]
    Wheels -->|COPY --from=builder /install /usr/local| Stage2

    Stage2 --> Image[Final image 450 MB]

    style Stage1 fill:#fef3c7,stroke:#b45309
    style Stage2 fill:#dbeafe,stroke:#1e40af
    style Image fill:#dcfce7,stroke:#15803d

The builder stage has all the heavy tools. The runtime stage inherits only the compiled artifacts.

What does a production Dockerfile look like for an AI service?

# filename: Dockerfile
# description: Multi-stage build for a FastAPI AI service.
# Slim runtime, cached dependencies, non-root user.
# syntax=docker/dockerfile:1.6

# ---- Builder stage ----
FROM python:3.12-slim AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

WORKDIR /build

COPY pyproject.toml uv.lock ./
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev --no-install-project

# ---- Runtime stage ----
FROM python:3.12-slim AS runtime

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/* \
    && adduser --disabled-password --no-create-home agent

COPY --from=builder /build/.venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app
COPY --chown=agent:agent ./app ./app
COPY --chown=agent:agent pyproject.toml ./

USER agent

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD curl --fail http://localhost:8000/health || exit 1

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

7 decisions in this file are doing the work. python:3.12-slim is ~150 MB vs 900 MB for the full image. The builder stage has build-essential and uv; the runtime has neither. The --mount=type=cache on uv sync keeps the dependency cache warm across builds. --no-dev skips pytest, ruff, mypy. The runtime stage copies only the .venv from the builder, not the build context. A non-root agent user is created and USER agent switches to it. The HEALTHCHECK lets the orchestrator detect dead containers.

For the bigger-picture security patterns including the non-root user rationale, see the Docker non-root user for agentic AI security post.

How does the layer caching save build time?

Docker caches each COPY and RUN instruction by the hash of its inputs. The naive Dockerfile puts COPY . . before pip install, which means any source file change invalidates the dependency layer. The layered approach copies pyproject.toml and uv.lock first, runs uv sync, then copies source code. A code-only change skips the dependency install entirely.

graph LR
    A[COPY pyproject.toml uv.lock] --> B[uv sync --frozen]
    B --> C[COPY ./app]
    C --> D[Built image]

    Change[Code change<br/>invalidates only C and D] --> C

    style B fill:#dcfce7,stroke:#15803d
    style C fill:#fef3c7,stroke:#b45309

The payoff: a source-only rebuild drops from 8 minutes to 30 seconds. Multiply that by your engineers and CI runs per day and the time savings compound quickly.

For the full layer caching deep-dive with cache mounts and BuildKit, see the Docker layer caching for faster agent builds post.

What should NOT go in the runtime stage?

The runtime stage is only what the service reads at request time. These belong in the builder stage or not in the image at all:

  1. Build toolchain. gcc, g++, make, python3-dev. The runtime binary does not need them.
  2. Test dependencies. pytest, ruff, mypy, coverage tools. These run in CI, not in production.
  3. Source map generation or asset compilation tools. Already compiled in the builder stage.
  4. Package caches. pip's cache, apt's lists. Remove with rm -rf /var/lib/apt/lists/*.
  5. Documentation. /usr/share/doc, /usr/share/man. Not needed by any production service.
  6. Test data and fixtures. Separate these from production code or add a .dockerignore that excludes them.

The rule: after building the runtime image, run docker image ls and ask "is every byte here doing real work." If the answer is no, something needs to move out.

How do you measure the image efficiency?

4 metrics to track on every push:

  1. Image size (docker image ls). Target: under 500 MB for a standard AI service.
  2. Layer count (docker history). Target: under 15 layers. Each extra layer is overhead.
  3. Pull time from registry. Target: under 30 seconds from a warm CDN.
  4. Cold start time (container spawn to first request served). Target: under 10 seconds.

Run these as part of CI. If any metric regresses by more than 20 percent, fail the build. That catches accidental bloat before it reaches production.

For the broader infrastructure picture that sits alongside the image build, the Build your own coding agent course walks through deploying a real agent. The free AI Agents Fundamentals primer is the right starting point if the agent loop is still new.

What to do Monday morning

  1. Open your current Dockerfile. Run docker image ls and note the size. If it is over 1 GB, the layered pattern is worth the effort.
  2. Split into builder and runtime stages. Put build-essential and any toolchain in the builder. Runtime starts from python:3.12-slim and copies only the compiled venv.
  3. Switch to uv sync from pip install if you haven't already. Add the --mount=type=cache flag for persistent dep caching across builds.
  4. Add a non-root agent user and switch to it with USER agent. Set ownership on the copied source with --chown.
  5. Add .dockerignore to exclude tests/, scripts/, .git/, local virtualenvs, and any data that production does not need.
  6. Measure the before and after. Target a 60-70 percent size reduction on a realistic AI service.

The headline: a production AI image is a minimal runtime with a heavy builder that never ships. 5 Dockerfile decisions cut the size by more than half, the pull time by even more, and the cold start time into the 10-second range.

Frequently asked questions

Why should AI services use multi-stage Docker builds?

Because AI dependencies are huge (torch, transformers, numpy) and many of them require a build toolchain (gcc, python-dev) that you only need at install time, not at runtime. A single-stage image keeps the toolchain and bloats by hundreds of megabytes. A multi-stage build compiles wheels in a builder stage and copies only the final binaries into a slim runtime image, cutting image size by 60-70 percent without losing functionality.

How much smaller is a layered AI image vs a naive one?

Typical numbers: a naive python:3.12 + torch + transformers + langchain image is 2.5-2.8 GB. The same dependencies in a layered build on python:3.12-slim with a multi-stage Dockerfile lands at 400-500 MB. Pull time drops from several minutes to under 30 seconds from a warm CDN. Cold start drops into the 10-second range because there is less to load into memory.

What is the single biggest Docker layer cache win?

Copying pyproject.toml and the lockfile BEFORE copying any source code, then running uv sync as a dedicated step. This makes the dependency layer cacheable as long as the lockfile hasn't changed. A source-only code change rebuilds in 30 seconds instead of 8 minutes because the cached dependency layer is reused.

Should I use python:3.12-slim or a distroless image?

Start with python:3.12-slim. It is 150 MB, has a working shell for debugging, and is the right baseline for 95 percent of AI services. Distroless (gcr.io/distroless/python3) is 50 MB smaller but removes the shell, which complicates debugging. Move to distroless only when you are sure the observability story is already solid and you need the last bit of size reduction.

How do I handle secrets at build time without baking them into the image?

Use Docker BuildKit's --mount=type=secret flag to pass secrets during pip install or private-repo clones without them being stored in any layer. Or better, avoid build-time secrets entirely by pulling from public package mirrors and letting the runtime container read secrets from the environment. Baking keys into an image is a security anti-pattern even when the image is private.

Key takeaways

  1. Naive AI service images are 2-3 GB because they keep build toolchains, test deps, caches, and source bloat that the runtime never reads.
  2. Multi-stage builds keep heavy tools in a builder stage and ship a minimal runtime. Typical cut: 2.8 GB to 450 MB, a 6x reduction.
  3. Copy dependency manifests before source code, then run uv sync with a BuildKit cache mount. This makes the dependency layer cacheable across code changes.
  4. Runtime stage = python:3.12-slim + compiled venv + source + non-root user. Nothing else belongs there.
  5. Measure image size, layer count, pull time, and cold start in CI. Fail the build on significant regressions to prevent accidental bloat.
  6. To see this Docker pattern wired into a full production agent stack with auth, tools, and observability, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

For the full Docker multi-stage build documentation and advanced patterns like target selection and shared build contexts, see the Docker multi-stage builds guide. The official reference covers every flag in the Dockerfile above.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.