Designing and Implementing Production Machine Learning Systems (MLOps)
As machine learning (ML) becomes ubiquitous in technology, there is an increasing need for well-engineered ML systems and processes that enable ML algorithms to drive business value. Enterprise ML has experienced a shift in focus from just the ML models themselves to the software engineering, infrastructure and best practices necessary to support ML at scale in production. Bringing a model from a data scientist’s notebook to running live in an application requires robust systems, MLOps and ML governance.
This course is an introduction to ML systems in production that will demonstrate and give students exposure to how real production ML systems operate. Using Python, Docker, Kubernetes, Google Cloud and various open-source tools, students will bring the different components of an ML system to life and setup real, automated infrastructure. It will be mostly in Python, docker, kubernetes, and google cloud in addition to lots of open source tools.
Unit 1 - Overview of Machine Learning Systems in Production
- Machine learning in industry versus academia
- Comparing ML engineering and software engineering
- Components of production ML systems
- Online versus offline ML systems
- Demonstration: a production ML system
- Hands-on: Introduction to Google Cloud, project setup, and gcloud commands
- Hands-on: Setting up our git repository
Unit 2 - Machine Learning Engineering Fundamentals
- Software engineering principles
- Systems design 101
- ML Systems design 101
- MLOps concepts and design principles
- Hands-on: Essential Google Cloud services for ML
- Hands-on: Kubernetes and Google Kubernetes Engine (GKE) intro
- Your ML in production project: Ideating
Unit 3 - Feature Systems
- Introduction to feature systems
- Common feature systems design patterns
- Developer experience in feature systems and ML systems
- Hands-on: Working with different feature sources and data stores on Google Cloud
- Hands-on: Building a miniature feature system in the cloud
- Your ML in production project: Ideating
Unit 4 - ML Model Training Pipelines
- Components of ML training pipelines
- Workflow orchestration and automation
- Cost and value analysis
- Setting up an ML pipeline
- Hands-on: Introduction to Kubeflow and building an automated pipeline
- Hands-on: Running training automated jobs on Kubernetes
- Your ML in production project: Design and Planning
Unit 5 - Managing Training Experiments, ML Metadata, and Model Registries
- Experimentation as an ML practitioner
- Hands-on: Setting up a centralized metadata store and model registry
- Hands-on: Tracking and logging hyperparameters
- Hands-on: Using model registries
- Your ML in production project: Design and Planning
Unit 6 - Deploying Machine Learning Models
- Generating offline predictions
- Online model serving systems
- Common real-time deployment architectures
- Hands-on: Developing an automated offline prediction workflow using Kubeflow and Dataflow
- Hands-on: Deploying ML models on Kubernetes for real-time inference with Seldon
- Hands-on: Scaling ML model deployments
- Your ML in production project: Architecture Review
Unit 7 - ML Observability
- Infrastructure and software observability
- Latency, throughput, availability, and reliability
- ML observability, ML model/feature drift, and ML explainability
- Fairness and bias
- Hands-on: Setting up Prometheus and Grafana on Kubernetes
- Hands-on: Accessing logs and metrics in Google Cloud
- Hands-on: Logging predictions and implementing ML observability
- Your ML in production project: Architecture Review
Unit 8 - Experimentation and Reliability Engineering
- ML experimentation design and algorithms 101
- Hands-on: A/B testing with Seldon on Kubernetes
- Hands-on: Multi-armed bandits with Seldon on Kubernetes
- Hands-on: Canary/shadow deployments on Kubernetes
- Your ML in production project: Implementation
Unit 9 - Continuous Learning
- Streaming versus batch processing
- Event-driven, asynchronous systems
- Stateful ML systems and incremental model updates
- Hands-on: Designing and implementing a stateful ML system on Kubernetes
- Your ML in production project: Implementation
Unit 10 - Machine Learning Governance
- Observability, visibility and control
- Monitoring and alerting
- Model service catalogue
- Security
- Compliance and auditability
- Your ML in production project: Presentation.
Prerequisites:
- It is expected you have familiarity with an object-oriented programming language (preferably Python) and experience with basic machine learning concepts and models. Some previous exposure to a cloud environment (AWS, Google Cloud, Azure, etc) or other software engineering experience would be helpful but not necessary.
Certificate:
- Certificates are awarded at the end of the program at the satisfactory completion of the course. Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.
This course is available for "remote" learning and will be available to anyone with access to an internet device with a microphone (this includes most models of computers, tablets). Classes will take place with a "Live" instructor at the date/times listed below.
Upon registration, the instructor will send along additional information about how to log-on and participate in the class.
School Notes: We offer a certification licensed by the NYS Board of Education.