Highlights
- What you'll build: a trained scikit-learn model wrapped in a FastAPI service, containerized with a multi-stage Docker build, and deployed to Kubernetes with health checks and autoscaling. Laptop-only, no cloud account required.
- Time required: about 60 minutes if you code along.
- Prerequisites: Python 3.11+, Docker Engine 26.x,
kubectl, and a local cluster (kind or minikube). - Stack: FastAPI for serving, Docker multi-stage builds for packaging, Kubernetes Deployment + Service + HPA for orchestration, and KServe, Seldon Core, or BentoML for when you outgrow raw manifests.
- Outcome: a reproducible, self-healing inference service you can extend with CI/CD, GPUs, or LLM serving.
- Current versions: Kubernetes v1.36, Docker Engine 26.1, KServe v0.18, Seldon Core 2.8, BentoML 1.4. Most deployment tutorials online still use deprecated probes and
on_eventstartup hooks; this one does not.
Your model is finished. It hits 96% on the validation set, the notebook runs top to bottom, and on the demo call everyone nods. Then a backend engineer asks the only question that actually matters: "Great, how do we ship it?" Three weeks later it is still a model.pkl file on your laptop, and the dashboard the product team was promised shows nothing.
That gap, between a model that works in a notebook and one that serves live traffic, is where roughly 87% of machine learning models die. It is not a modeling gap; it is a packaging-and-operations gap, and it is exactly why the MLOps market sits at $4.39 billion in 2026 and is projected to reach $89.91 billion by 2034, a 45.8% CAGR (Fortune Business Insights). Docker and Kubernetes are how you close that gap. Docker makes your model run the same way everywhere; Kubernetes keeps it running when everything around it is changing. This guide takes one trained model from a notebook to an autoscaling, self-healing inference service on a cluster, and it is honest about when that is overkill.
From Notebook to Cluster: The Two Problems
The single biggest reason model deployment feels chaotic is that people treat it as one task. It is two, and they are solved by two different tools.
Packaging. "It works on my machine" is not a deployment strategy. Your model needs its exact Python version, its exact library versions (a model pickled with scikit-learn 1.4 will refuse to load under 1.6), the model artifact itself, and a server to answer requests. All of that has to be frozen into one immutable unit that runs identically on your laptop, in CI, and in production. That is Docker's job.
Orchestration. One container on one machine is a single point of failure. The moment real traffic arrives you need several copies, a load balancer in front of them, health checks that pull dead copies out of rotation, rolling updates that do not drop requests, memory limits so one bad batch cannot take down the node, and autoscaling so you are not paying for idle capacity at 3 a.m. That is Kubernetes' job.
Get this mental model right and everything downstream falls into place: Docker is the unit of shipping; Kubernetes is the system that runs the units. Conflating the two is how teams end up with a lonely docker run on a single VM, no health checks, no replicas, and a pager that goes off every time the box reboots.
Want to learn MLOps by doing, not just reading?
Our 100 Days of MLOps challenge on KodeKloud Engineer walks you through real production scenarios, one hands-on lab at a time. You'll touch MLflow, Kubeflow, model deployment, monitoring, and the same workflows companies actually run in 2026.
Start the Challenge →Architecture of What We're Building
The flow is linear and the same regardless of framework. You train a model and save the artifact. You wrap it in a small web service that loads the artifact once and answers HTTP requests. You bake the service and the artifact into a Docker image. You push that image to a registry. Kubernetes pulls the image, runs several replicas behind a Service, watches their health, and scales them with load.
Here is the stack, mapped to the job each piece does. Pick tools by the problem they solve, not by their logo.
Everything in the core path uses long-stable Kubernetes APIs (apps/v1, autoscaling/v2) that will not churn under you. Kubernetes ships roughly every 15 weeks; the current line is v1.36 ("Haru"), and the manifests below run unchanged on any conformant cluster, from a local kind cluster to a managed EKS, GKE, or AKS.
The Tool Stack (and Why Each Choice)
FastAPI is the default serving choice in 2026 for a reason. It validates request bodies for you through Pydantic, generates an OpenAPI schema so downstream teams get a contract for free, and runs under an ASGI server (uvicorn) that handles concurrency without you writing thread pools. Flask still works, but you would be reimplementing validation and schema generation by hand.
Docker is non-negotiable for reproducibility. The subtlety, covered in Stage 2, is that a naive image is enormous and insecure; a multi-stage build fixes both.
Kubernetes is the orchestration layer, but it is not the only option, and a large fraction of models never need it. We will deploy with plain Deployment, Service, and HPA objects first, because they teach you the fundamentals and are genuinely the whole production story for many predictive models. Stage 6 covers when a purpose-built serving framework earns its keep.
Prerequisites & Project Setup
You need Python 3.11+, Docker Engine 26.x, kubectl, and a local Kubernetes cluster. If you do not have one, kind create cluster or minikube start gives you a real cluster on your laptop in under a minute.
The project layout:
iris-deploy/
├── app/
│ ├── __init__.py # makes app/ an importable package
│ └── main.py # FastAPI inference service
├── train.py # produces model.joblib
├── model.joblib # the trained artifact
├── requirements.txt
├── Dockerfile
└── k8s/
├── deployment.yaml
├── service.yaml
└── hpa.yamlPin your dependencies. Unpinned versions are the most common cause of "it loaded yesterday, it crashes today":
# requirements.txt
fastapi==0.136.*
uvicorn[standard]==0.34.*
scikit-learn==1.6.*
joblib==1.4.*
pydantic==2.*A throwaway training script so the rest of the guide has an artifact to deploy:
# train.py
import joblib
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)
joblib.dump(model, "model.joblib")
print("wrote model.joblib")Create a virtual environment and install those dependencies locally before you run anything, otherwise train.py and the local server will fail with ModuleNotFoundError:
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows (PowerShell / CMD)
pip install -r requirements.txtA throwaway training script so the rest of the guide has an artifact to deploy:
# train.py
import joblib
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)
joblib.dump(model, "model.joblib")
print("wrote model.joblib")python train.pyStage 1 - Wrap the Model in an Inference API
A model file is not a service. Something has to load it into memory once, at startup, and answer HTTP requests against that in-memory object. Loading the model inside the request handler instead is the classic rookie mistake: every prediction re-reads the artifact from disk and your latency is dominated by I/O.
The current way to load once is FastAPI's lifespan context manager. A lot of tutorials still use @app.on_event("startup"), which has been deprecated; lifespan is the supported replacement.
# app/main.py
from contextlib import asynccontextmanager
import joblib
from fastapi import FastAPI, Response, status
from pydantic import BaseModel
ml = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
ml["model"] = joblib.load("model.joblib") # loaded once, before traffic
yield
ml.clear()
app = FastAPI(title="iris-classifier", lifespan=lifespan)
class Features(BaseModel):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
@app.get("/healthz")
def healthz():
# liveness: is the process up? cheap, no model call
return {"status": "ok"}
@app.get("/ready")
def ready(response: Response):
# readiness: is the model loaded and able to serve?
if "model" not in ml:
response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE
return {"status": "loading"}
return {"status": "ready"}
@app.post("/predict")
def predict(f: Features):
X = [[f.sepal_length, f.sepal_width, f.petal_length, f.petal_width]]
pred = ml["model"].predict(X)[0]
return {"prediction": int(pred)}Note the two distinct health endpoints. /healthz answers "is the process alive?" and is dirt cheap. /ready answers "is the model loaded and able to serve?" and returns 503 until the artifact is in memory. Kubernetes uses these for two different jobs, and conflating them is the cause of the most common deployment failure, covered in Stage 5. Test locally:
uvicorn app.main:app --reload
curl -s localhost:8000/ready
curl -s -X POST localhost:8000/predict \
-H "content-type: application/json" \
-d '{"sepal_length":5.1,"sepal_width":3.5,"petal_length":1.4,"petal_width":0.2}'Stage 2 - Containerize with a Multi-Stage Docker Build
This is where most ML images go wrong. A naive FROM python:3.12 plus pip install produces a 1.5 GB image stuffed with compilers and build tools you will never run in production: slow to pull, slow to scale, and a far larger attack surface. The fix is a multi-stage build: install dependencies in a fat builder stage, then copy only the runtime artifacts into a slim final image.
# ---- builder ----
FROM python:3.12-slim AS builder
WORKDIR /app
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# ---- runtime ----
FROM python:3.12-slim
RUN useradd --create-home --uid 1000 appuser
WORKDIR /app
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY app/ ./app/
COPY model.joblib .
USER appuser # never run as root
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]This pattern is not cosmetic. A naive single-stage ML image routinely runs over a gigabyte; a multi-stage build commonly brings the runtime image down to a few hundred megabytes. Smaller images pull faster, scale faster, and shrink the attack surface, because the compilers and build tools that carry most of the CVEs never reach the final image. Three rules pay for themselves here:
- Copy
requirements.txtbefore your source code. Docker's BuildKit caches layers; if dependencies are installed before the app code is copied, changing a line of Python does not trigger a full reinstall. - Run as a non-root user. A container that gets compromised should not also be root.
- Pin the base image (
python:3.12-slim, notpython:latest) so the build is reproducible.
Build and smoke-test before it ever touches a cluster:
docker build -t iris-classifier:0.1.0 .
docker run -p 8000:8000 iris-classifier:0.1.0
curl -s localhost:8000/readyStage 3 - Push the Image to a Registry
Kubernetes pulls images; it does not build them. Tag the image for your registry and push it. Tag with a real version, never latest. Using latest makes rollbacks ambiguous (which build was running?) and breaks the reproducibility that was the whole point of containerizing.
For a real cluster, push to a registry:
docker tag iris-classifier:0.1.0 registry.example.com/ml/iris-classifier:0.1.0
docker push registry.example.com/ml/iris-classifier:0.1.0On a local kind cluster you do not need a registry at all. Load the image straight into the cluster under the same name the manifest references, so Kubernetes finds it locally:
docker tag iris-classifier:0.1.0 registry.example.com/ml/iris-classifier:0.1.0
kind load docker-image registry.example.com/ml/iris-classifier:0.1.0The Deployment in Stage 4 sets imagePullPolicy: IfNotPresent, so on kind it uses the loaded image instead of trying to pull from the placeholder registry. On minikube, point your Docker CLI at the cluster's daemon before docker build so the image is built directly inside the cluster:
eval $(minikube docker-env) # macOS / Linux
# & minikube docker-env | Invoke-Expression # Windows (PowerShell)Stage 4 - Deploy to Kubernetes (Deployment + Service)
Now the model becomes a managed workload. A Deployment declares how many replicas you want and which image they run; Kubernetes continuously makes reality match that declaration and keeps it there, restarting any replica that dies. A Service gives that fleet of replicas one stable address and load-balances across them.
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: iris-classifier
spec:
replicas: 3
selector:
matchLabels:
app: iris-classifier
template:
metadata:
labels:
app: iris-classifier
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000 # matches the appuser in the Dockerfile
seccompProfile:
type: RuntimeDefault
containers:
- name: server
image: registry.example.com/ml/iris-classifier:0.1.0
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
ports:
- containerPort: 8000
resources:
requests: # what the scheduler reserves
cpu: "250m"
memory: "512Mi"
limits: # the hard ceiling before the kernel kills it
cpu: "1"
memory: "1Gi"
readinessProbe: # routes traffic only once /ready returns 200
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe: # restarts the pod if /healthz stops responding
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: 15
periodSeconds: 20
---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: iris-classifier
spec:
selector:
app: iris-classifier
ports:
- port: 80
targetPort: 8000kubectl apply -f k8s/
kubectl rollout status deployment/iris-classifier
kubectl port-forward svc/iris-classifier 8080:80 # test itStage 5 - Health Checks, Resources, and Autoscaling
The fields people skip in Stage 4 are the ones that keep a model alive in production. They are not decoration; they are the deployment.
Resource requests and limits are mandatory for ML workloads. Models hold their weights in memory. Without a memory limit, one oversized batch can consume the whole node and take its neighbors down with it. Without requests, the scheduler has no idea how much room a pod needs and packs them badly. Omit them and you will meet OOMKilled, the status Kubernetes shows when the kernel killed your container for exceeding memory, almost always under load and almost always at the worst time.
Readiness and liveness probes do different jobs, and using one endpoint for both is the most common deployment bug. The liveness probe hits /healthz: if it fails, Kubernetes restarts the pod. The readiness probe hits /ready: until it passes, Kubernetes keeps the pod out of the Service's load-balancing rotation. A pod that is still loading a multi-hundred-MB model will answer a TCP check and even a cheap HTTP check while failing real predictions. Because our /ready endpoint returns 503 until the model is actually in memory, a rolling update never sends traffic to a pod that cannot serve it. Set liveness looser than readiness, or a slow model load will trip a restart loop before the model finishes loading.
Once requests and limits exist, you can autoscale on them. The HPA reads CPU usage from metrics-server, which is not installed on a fresh cluster. Install it first, or the HPA will sit at <unknown>/70% and never scale:
# minikube
minikube addons enable metrics-server
# kind (or any vanilla cluster)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# on kind, trust the kubelet's self-signed cert:
kubectl -n kube-system patch deploy metrics-server --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'The HorizontalPodAutoscaler then adds and removes replicas as load changes:
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: iris-classifier
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: iris-classifier
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70kubectl apply -f k8s/hpa.yaml
kubectl get hpa iris-classifier --watchThat is a self-healing, horizontally scaling inference service. For a large share of predictive models, this is the entire production story.
Stage 6 - When to Graduate to a Serving Framework
Hand-written manifests are the right starting point and often the right ending point. But once you are serving many models, or you need scale-to-zero, canary rollouts, GPU sharing, or LLM-specific features, a purpose-built serving layer earns its keep. The three that matter in 2026:
- KServe. A CNCF incubating project since late 2025, now at v0.18. It is Knative-based, so it gives you scale-to-zero and canary rollouts out of the box, a standardized V2 inference protocol, and first-class generative AI support: its
LLMInferenceServiceexposes an OpenAI-compatible API backed by a vLLM runtime. The cost is the dependency stack; Knative and its routing rules are real operational weight. - Seldon Core v2. The heavyweight for regulated industries: the strongest built-in payload logging, drift detection, and explainability (via Alibi), running MLServer and NVIDIA Triton behind a V2 protocol. Note the licensing change. Since January 2024, Core 2.7+ ships under the Business Source License, which is free for non-production but requires a commercial license in production.
- BentoML. Python-first and, unlike the other two, not Kubernetes-native: it runs anywhere and pairs with a cluster only when you need one. The best fit for small, fast-moving teams who write Python and do not want to hand-edit
InferenceServiceYAML. Current release: 1.4.39.
The most common wrong turn is a Python-first team adopting KServe because it is the CNCF "standard," then burning a week on Knative routing rules. Match the tool to your operating reality.
Putting It All Together (and Knowing When to Keep It Simple)
End to end, you now have a repeatable path: train and save the artifact, serve it with FastAPI, freeze it into a slim multi-stage image, push to a registry, and run it on Kubernetes with probes, resource limits, and an HPA. Re-deploying a new model version is a three-line loop: build a new tagged image, push it, kubectl set image (or kubectl apply an updated manifest), and Kubernetes rolls it out one pod at a time with zero downtime because the readiness probe gates traffic.
But every line of that has an operational cost, and pretending otherwise is how teams over-engineer. A cluster has to be patched, secured, and observed. Knative adds components. Autoscaling needs a metrics pipeline. Kubernetes is not free. If you serve a single low-traffic model, you do not need any of this: a single Docker container on a managed container service, or a serverless container platform, will cost less and break less. Reach for Kubernetes when you have real reasons, multiple models, spiky traffic, a hard zero-downtime requirement, or a platform team already running a cluster. Adopt it because the problem demands it, not because it is on the architecture diagram.
What You've Actually Built
Concretely, you now have:
- A FastAPI inference service that loads the model once via lifespan and exposes distinct liveness and readiness endpoints.
- A lean, non-root, multi-stage Docker image that is reproducible and small.
- A Kubernetes Deployment running three self-healing replicas behind a stable Service.
- Resource limits and probes that survive bad batches and rolling updates without dropping traffic.
- A HorizontalPodAutoscaler that scales the model from 3 to 20 replicas on demand.
- A clear-eyed map of when to graduate to KServe, Seldon, or BentoML, and when to stay simple.
The natural next layers are observability (Prometheus and Grafana on your service metrics), drift detection (Evidently) so you catch the model degrading while it still returns 200s, and GitOps (Argo CD) so deployments are reviewable and reversible with a git revert.
Common Gotchas
These are the ones that bite in week one:
OOMKilledunder load. Set memoryrequestsandlimits, and profile peak memory with a realistic batch before you pick the numbers.- Rolling update drops requests. Your readiness probe is passing before the model is loaded. Point it at
/ready, which returns503until the artifact is in memory, not at a probe that only checks the process. - Image is 1.5 GB and scaling is slow. Switch to a multi-stage build and a slim base; ship only runtime artifacts, not compilers.
- Pod stuck in
CrashLoopBackOff. Usually the model file is missing from the image, a dependency version mismatch, or the liveness probe is tighter than the model load time. Checkkubectl logsand looseninitialDelaySeconds. - Rollback is ambiguous. You tagged
latest. Use immutable version tags and let the Deployment manage revisions (kubectl rollout undo). model.pklwon't load in the container. The pickling library version differs between training and serving. Pin scikit-learn (and friends) to the exact versions used to train.
Ready to Build It, Not Just Read About It?
Reading about deployment is one thing. Writing a Dockerfile that does not balloon to 1.5 GB, debugging a pod wedged in CrashLoopBackOff at 11 p.m., and watching an HPA scale your model under real load are entirely different skills, and they only come from doing the work. That's what the 100 Days of MLOps challenge on KodeKloud is built for: real environments, real tools, auto-validated tasks across the full lifecycle, from data versioning with DVC to serving on Kubernetes and catching drift with Evidently. By the end you will have run the lifecycle this guide describes, from container to cluster, with the muscle memory to prove it. Create your free KodeKloud account
FAQs
Q1: Do I actually need Kubernetes to deploy a machine learning model?
No. If you serve a single model at low or steady traffic, a Docker container on a managed service or a serverless container platform is cheaper to run and far simpler to operate. Containerize first, because that gives you reproducibility no matter where you run it. Reach for Kubernetes when you have several models, traffic that spikes, a hard zero-downtime requirement, or a platform team already operating a cluster. The orchestration should answer a real need, not decorate the diagram.
Q2: What's the real difference between Docker and Kubernetes here?
They solve two different problems. Docker handles packaging: it freezes your model, its exact dependencies, and the inference server into one immutable image that runs the same way everywhere. Kubernetes handles orchestration: it runs many copies of that image, restarts the ones that die, routes traffic only to healthy pods, rolls out new versions without downtime, and scales replicas with load. Docker is the unit of shipping; Kubernetes is the system that runs the units. You almost always use both together.
Q3: Should I write my own Deployment YAML or use KServe, Seldon, or BentoML?
Start with a hand-written Deployment, Service, and HPA. For one or a few predictive models it is the whole production story, and it teaches you the fundamentals you will need to debug anything fancier. Graduate to a framework when you hit a concrete need: scale-to-zero and canary rollouts (KServe), payload logging, drift detection, and explainability for regulated industries (Seldon Core v2), or a Python-first workflow that should not touch YAML (BentoML). Adopt the framework for the feature, not the logo.
Q4: How does deploying a deep learning model (PyTorch, TensorFlow) differ?
The shape is identical: API, image, Deployment, Service, probes, HPA. Three things change. The image is larger and often needs a CUDA base image for GPU inference, so multi-stage builds matter even more. You request GPUs in the pod spec (nvidia.com/gpu: 1) and the cluster needs the NVIDIA device plugin. And model load time is longer, so set initialDelaySeconds generously and lean on the readiness probe to keep traffic away until the weights are loaded. For large language models specifically, skip raw Deployments and use KServe's vLLM runtime.
Sources: Kubernetes Releases; Docker multi-stage builds; FastAPI lifespan events; KServe (CNCF); Seldon open-core licensing; BentoML 1.4 release; Fortune Business Insights MLOps Market Report 2026.
Discussion