YoVDO

Twine - A Unified Cluster Management System for Shared Infrastructure

Offered By: USENIX via YouTube

Tags

OSDI (Operating Systems Design and Implementation) Courses Cluster Management Courses Data Center Management Courses Twine Courses

Course Description

Overview

Explore a comprehensive presentation on Twine, Facebook's innovative cluster management system designed for shared infrastructure. Delve into the system's unique approach to managing one million machines across multiple data centers in a geographic region through a single control plane. Learn about the TaskControl API that enables application-specific customization, and discover how host profiles are utilized to optimize hardware and OS settings for diverse workloads. Understand the rationale behind Facebook's decision to deploy power-efficient small machines universally and leverage autoscaling for improved utilization. Gain insights into the challenges and solutions involved in migrating workloads to shared infrastructure, and examine the lessons learned from implementing this large-scale system. Compare Twine's approach to conventional practices and explore its impact on performance, efficiency, and resource management in data centers.

Syllabus

Intro
Data center geographic regions
What design decisions did Twine make differently?
What if we used Kubernetes?
How does Twine avoid stranded capacity?
How does Twine perform fleet-wide optimization?
How does Twine perform fleet-wide optimization fo. entire geographic region?
How well does the Twine scheduler scale?
How do we mitigate risks with 1M machines per deployment?
Private pools or shared infrastructure?
What is host customization?
What is the overhead for host profile switches?
What drives host profile changes?
What are the challenges with supporting ubiquitous shared infrastructure?
Challenge: Tasks are not homogenous
How does Twine collaborate with applications?
What is our shared infrastructure adoption?
How easy is it to migrate onto shared infrastructure.
Power is our most constrained resource
Big machines or small machines?
Why use small machines?
How much do we save by using small machines?
What lessons did we learn using small machines?
Conclusion


Taught by

USENIX

Related Courses

GraphX - Graph Processing in a Distributed Dataflow Framework
USENIX via YouTube
Theseus - An Experiment in Operating System Structure and State Management
USENIX via YouTube
RedLeaf - Isolation and Communication in a Safe Operating System
USENIX via YouTube
Microsecond Consensus for Microsecond Applications
USENIX via YouTube
KungFu - Making Training in Distributed Machine Learning Adaptive
USENIX via YouTube