Running AI/ML Workloads on Kubernetes Using Kubeflow: A Beginner’s Guide

by Nimesha Jinarajadasa
Nimesha Jinarajadasa
Nimesha Jianrajadasa is a DevOps & Cloud Consultant, K8s expert, and instructional content strategist-crafting hands-on learning experiences in DevOps, Kubernetes, and platform engineering.
- LinkedIn
•
March 21, 2025
•
5 min read

Join 1M+ Learners

Learn & Practice DevOps, Cloud, AI, and Much More — All Through Hands-On, Interactive Labs!

Create Your Free Account Cyber Monday Sale: Up to 50% OFF* On Annual Plans *terms and conditions apply

Running AI/ML Workloads on Kubernetes Using Kubeflow: A Beginner’s Guide

Artificial Intelligence (AI) and Machine Learning (ML) are transforming industries — from healthcare and finance to retail and transportation. But as models become more complex and datasets grow larger, data scientists and engineers need powerful, scalable, and automated infrastructure to manage their workloads.

Kubernetes and Kubeflow.

In this beginner-friendly guide, we’ll explore how Kubernetes and Kubeflow work together to simplify and scale AI/ML workflows — with examples, architecture insights, and essential concepts you need to know.

Why Kubernetes for AI/ML Workloads?

Kubernetes is an open-source platform that automates container orchestration — deploying, scaling, and managing containerized applications.

Here’s why Kubernetes is ideal for ML workflows:

Scalability: Distribute training workloads across multiple GPUs and nodes.
Resource Optimization: Run jobs only when needed, scale up/down as required.
Reproducibility: Ensure consistent environments using containers (e.g., Docker).
Automation: Automate data preprocessing, training, tuning, and serving.
Portability: Run your ML pipelines on any cloud or on-premise environment.

What Is Kubeflow?

Kubeflow is an open-source ML toolkit built specifically for Kubernetes.

Think of it as the “ML Ops platform for Kubernetes.” It provides tools and frameworks to build, train, and deploy machine learning models — all while leveraging the power of Kubernetes underneath.

What Can You Do with Kubeflow?

Build ML pipelines with reusable steps
Run distributed training jobs (e.g., TensorFlow, PyTorch)
Automate hyperparameter tuning
Serve models via inference endpoints
Manage and monitor the entire ML lifecycle

Key Components of Kubeflow

Let’s look at the most important building blocks of Kubeflow:

Component	Purpose
Kubeflow Pipelines	Create, run, and manage ML workflows as DAGs (Directed Acyclic Graphs)
Katib	Hyperparameter tuning engine
KFServing / KServe	Model serving with autoscaling
Notebooks	Jupyter notebooks integrated with Kubernetes
TFJob / PyTorchJob	Distributed training support for popular frameworks
Central Dashboard	Unified UI to manage resources, pipelines, models, and experiments

A Simple Example: ML Pipeline with Kubeflow

Let’s say you're training a model to predict housing prices. Here’s how a basic Kubeflow pipeline might look:

Data Preprocessing

Load and clean the dataset using a Jupyter notebook or Python script.

Training the Model

Use TFJob or PyTorchJob to train your model on multiple GPUs.

Hyperparameter Tuning

Use Katib to try different learning rates, batch sizes, or model architectures.

Model Evaluation

Run evaluation jobs to validate accuracy and performance.

Model Serving

Deploy your best model using KServe, with REST or gRPC endpoints.

Monitoring and Retraining

Continuously monitor predictions and retrain the model if needed.

Architecture Overview: How Kubeflow Works on Kubernetes

Kubeflow is an open-source platform that simplifies and streamlines the end-to-end machine learning (ML) lifecycle by running on top of Kubernetes. Its modular design allows teams to orchestrate everything from data preparation to model training, tuning, and serving—entirely within a cloud-native environment.

Let's break down how Kubeflow integrates with Kubernetes and the core components involved in this architecture.

User Interaction

The journey begins with the User, who interacts through the Kubeflow Central Dashboard—a unified UI that provides access to all Kubeflow services and tools.

Core Functionalities of Kubeflow

Once inside the dashboard, users can navigate through various integrated components:

ML Pipelines: Define, deploy, and manage ML workflows using reusable pipeline steps.
Katib (Hyperparameter Tuning): Automate the hyperparameter tuning process using algorithms like random search or Bayesian optimization.
Notebooks (JupyterLab): Provision on-demand Jupyter notebooks for data exploration and model development.
Model Serving: Deploy and monitor trained models using inference services such as KFServing (KServe).

Each of these services runs as a microservice inside Kubernetes, leveraging its scalability, orchestration, and resource management features.

Kubernetes as the Backbone

Under the hood, all Kubeflow components run on a Kubernetes Cluster, utilizing native Kubernetes resources like:

Pods: Where individual ML components or jobs are executed.
Jobs: Used for one-time tasks like training models.
Services: Handle internal communication and expose applications to users.
Volumes: Manage persistent data such as models, logs, and datasets.

This infrastructure ensures resilience, scalability, and ease of deployment, making Kubeflow ideal for both research and production environments.

Architecture Diagram

Here is a visual representation of how everything connects—from the user to Kubernetes workloads:

Diagram courtesy of the official Kubeflow documentation

Running Kubeflow Locally or in the Cloud

Option 1: Minikube (for Local Testing)

Kubeflow can be installed on your local machine using Minikube, but it requires a lot of resources (8+ GB RAM).

Option 2: Cloud Providers

The easiest way to run Kubeflow in production is via cloud-managed services:

Google Cloud (GKE + Vertex AI)
AWS (EKS + Kubeflow)
Azure Kubernetes Service (AKS)

Cloud providers offer GPU support, object storage, and better scalability.

Getting Started with Kubeflow

Here’s a simplified path to run your first ML workload with Kubeflow:

Set up a Kubernetes Cluster

Use Minikube, GKE, EKS, or another provider to provision your cluster.

Install Kubeflow

Deploy using Kubeflow manifests or kfctl to set up the platform.

Access the Central Dashboard

Use the Kubeflow UI to explore features, pipelines, and monitoring tools.

Launch a Jupyter Notebook

Create an on-demand notebook instance to develop your model interactively.

Create Your First Pipeline

Use the Kubeflow Pipelines SDK to define and visualize your ML pipeline steps.

Train and Serve Your Model

Run your pipeline to train the model and deploy it using Kubeflow Serving (KServe).

Tools You’ll Use in a Typical Workflow

Tool	Role in ML Lifecycle
Docker	Package your model/code as containers
Kubernetes	Schedule and scale workloads
Kubeflow	Build and automate ML pipelines
Jupyter	Interactive development and data exploration
TensorFlow / PyTorch	Model building and training
Prometheus + Grafana	Monitoring model and resource usage

Learn More & Practice: KodeKloud AI/ML & Kubernetes Courses

If you're serious about becoming a DevOps ML Engineer or want to explore ML Ops, check out KodeKloud’s in-depth courses on Kubernetes, containers AI/ML and cloud:

Here's a bonus course to help you master monitoring your workloads.

Why Use Kubeflow on Kubernetes?

Benefit	Why It Matters
Scalable Training	Run models across GPUs and multiple nodes
End-to-End ML Pipelines	Automate everything from data to deployment
Cloud-Native	Works anywhere Kubernetes runs
Customizable & Modular	Use only what you need — notebooks, pipelines, or serving
Community & Ecosystem	Backed by CNCF, Google, and many contributors

Final Thoughts

Kubernetes solved the infrastructure problem.

Kubeflow solves the ML lifecycle problem.

Together, they bring powerful automation and scalability to data science teams — allowing them to focus on models, not infrastructure.

Whether you're a beginner in DevOps or an ML engineer looking to scale experiments, learning Kubeflow is a smart move.

Nimesha Jinarajadasa

Nimesha Jianrajadasa is a DevOps & Cloud Consultant, K8s expert, and instructional content strategist-crafting hands-on learning experiences in DevOps, Kubernetes, and platform engineering.