YoVDO

Practical Container Scheduling: Optimizations, Guarantees, and Trade-Offs at Netflix - Lecture

Offered By: Linux Foundation via YouTube

Tags

Microservices Courses Distributed Systems Courses Capacity Planning Courses Cluster Management Courses Autoscaling Courses Optimization Algorithms Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of container scheduling in large-scale distributed clusters through this conference talk from Netflix's Senior Software Engineer. Dive deep into the challenges, design, and trade-offs achieved using the open-source scheduling library Fenzo, which takes a holistic approach to provide a nimble scheduling core for various independently evolving clusters. Learn about capacity guarantees, task placement, elasticity, and operational insights for tackling large-scale operations. Discover how Netflix juggles multiple scheduling objectives and constraints, including bin packing, task locality, and capacity guarantees, to efficiently run microservices, batch, and stream processing applications in shared Mesos clusters. Gain valuable insights into multi-goal optimization, cluster autoscaling, and extensibility strategies. Explore fitness functions, hard and soft constraints, and queuing setups used in Netflix's container scheduling process. Understand how to reason about allocation failures and size agent clusters for capacity. This talk provides practical knowledge for engineers working with container scheduling in complex, large-scale environments.

Syllabus

Intro
Reactive stream processing: Mantis
Container deployment: Titus
What the cluster needs to support - Heterogeneous mix of workload
Why juggle at all?
Scheduling challenge in large clusters
Our initial goals for a cluster scheduler • Multi goal optimization for task placement . Cluster autoscaling • Extensibility
Multi goal task placement
Security
Capacity guarantees
Fenzo scheduling strategy
Fitness functions we use • CPU, memory, and network in packing
Hard constraints we use • GPU server matching
Soft constraints we use • Specified by individual jobs at submittime • Balance tasks of a job across availability zones
Mixing fitness with soft constraints
Our queues setup
Sizing agent clusters for capacity
Reasoning about allocation failures
What's next?
Questions?


Taught by

Linux Foundation

Tags

Related Courses

Advanced Operating Systems
Georgia Institute of Technology via Udacity
High Performance Computing
Georgia Institute of Technology via Udacity
GT - Refresher - Advanced OS
Georgia Institute of Technology via Udacity
Distributed Machine Learning with Apache Spark
University of California, Berkeley via edX
CS125x: Advanced Distributed Machine Learning with Apache Spark
University of California, Berkeley via edX