YoVDO

Demystifying Machine Learning in Production - Reasoning about a Large-Scale ML Platform

Offered By: USENIX via YouTube

Tags

SREcon Courses Machine Learning Courses Data Integrity Courses

Course Description

Overview

Explore a comprehensive talk on enhancing machine learning reliability in production environments. Learn about common failure modes in large-scale ML systems and discover best practices for productionization. Gain insights into monitoring systems, protecting against human error, ensuring data integrity, and managing pipeline workloads efficiently. Understand the challenges of ML in production, including binary and configuration changes, data updates, and pipeline backlogs. Apply an outside-in approach to ML reliability, drawing from experiences with a large-scale ML production platform at Google.

Syllabus

Intro
4 things you can do for more reliable ML
ML on one machine
ML in production
What makes ML in prod interesting
What goes wrong?
4 things for more reliable ML
ML outages from the outside
Where changes happen: binaries
Where changes happen: configuration
Validating binary and config changes
Where changes happen: data
Validating data updates
Improving data integrity
Handling pipeline backlogs


Taught by

USENIX

Related Courses

Cryptography I
Stanford University via Coursera
MongoDB Advanced Deployment and Operations
MongoDB University
Developing SQL Databases
Microsoft via edX
Six Sigma Tools for Define and Measure
University System of Georgia via Coursera
Using clinical health data for better healthcare
The University of Sydney via Coursera