This course takes you from ad-hoc Jupyter notebooks to production-style MLOps workflows tailored for AIOps use cases such as anomaly detection and incident prediction. You will learn how to make models reproducible, trackable, and deployable at scale using MLflow for experiment tracking and model packaging, and Kubeflow Pipelines for end-to-end orchestration on Kubernetes. Through hands-on labs, you will convert exploratory code into parameterized scripts, track experiments with MLflow and MinIO, deploy inference services to Kubernetes, and automate train → register → validate → deploy pipelines for AIOps workloads.
Prerequisites
- Python Basics
- Kubernetes Basics
- Machine Learning Fundamentals
Course Highlights
- Why MLOps Is Critical for AIOpsThis module explains the shift from exploratory notebook-based experimentation to production-grade ML workflows for AIOps, showing how MLOps practices improve reliability, observability, and governance for models used in incident management and anomaly detection.
- Lab 1.1 – From Notebook to Production
- Convert an anomaly detection Jupyter notebook into a
train.py script. - Introduce CLI arguments, random seeds, and reproducibility best practices.
- Run multiple configurations manually to prepare for later automation.
- Experiment Tracking & Model Packaging with MLflowThis module introduces MLflow Tracking for capturing parameters, metrics, and artifacts across experiments, and shows how to configure MinIO as an S3-compatible artifact store integrated with an MLflow Tracking Server for reproducible AIOps workflows.
- Lab 2.1 – Setting Up MLflow & MinIO
- Deploy MinIO and MLflow Tracking Server.
- Verify UI access and connectivity between MLflow and MinIO as an artifact backend.
- Lab 2.2 – Logging Parameters, Metrics, and Artifacts
- Instrument
train.py with MLflow Tracking API calls. - Log parameters, metrics, and model artifacts, then explore runs in the MLflow UI.
- Lab 2.3 – Packaging Models for Reproducibility
- Create
MLproject and conda.yaml files for reproducible runs. - Define entry points and re-run experiments using the MLflow CLI.
- Deploying & Serving AIOps ModelsThis module focuses on moving from trained models to live inference endpoints suitable for real-time anomaly detection, covering different ways to serve MLflow models and expose them via REST APIs for consumption by AIOps systems.
- Lab 3.1 – Serving Models with MLflow
- Serve the trained model locally from MLflow runs.
- Register the model in the MLflow Model Registry and serve it from there.
- Test predictions using
curl and Python requests.
- Lab 3.2 – Containerizing and Deploying to Kubernetes
- Package the serving application into a Docker image.
- Deploy the model-serving service to Kubernetes using
Deployment and Service resources. - Verify access to predictions through a REST API endpoint.
- Orchestrating AIOps Pipelines with KubeflowThis module teaches how to automate the full ML lifecycle — from training to validation to deployment — using Kubeflow Pipelines, and how to connect Kubeflow components with MLflow to build traceable, production-style AIOps pipelines.
- Lab 4.1 – Exploring Kubeflow Pipelines
- Access the Kubeflow UI and inspect pre-built sample pipelines.
- Run sample pipelines and observe execution graphs and artifacts.
- Lab 4.2 – Building the Training & Registration Components
- Create a
train component that uses the existing train.py script. - Create a
register component that pushes the trained model to the MLflow Model Registry. - Run a two-step Kubeflow pipeline to train and log a model to MLflow.
- Lab 4.3 – Building the Full Train → Validate → Deploy Pipeline
- Add a
validate component that checks whether the model’s anomaly rate is within acceptable limits. - Add a
deploy component that consumes the trained model and deploys it to production. - Compile, upload, and trigger a four-step Kubeflow pipeline to automate model training and serving end to end.