Real-Time Adaptive Controls for Resilient Distributed Systems
Offered By: USENIX via YouTube
Course Description
Overview
Explore a conference talk on implementing real-time adaptive controls for enhancing the resilience of distributed systems. Dive deep into CrowdStrike's approach to dynamically tuning service parameters using techniques inspired by TCP congestion control. Learn how this method improves system resilience by real-time sampling of errors and latencies, eliminating the need for periodic manual adjustments. Discover the challenges and lessons learned from deploying this feature in CrowdStrike's massive production environment, which handles trillions of events daily. Gain insights into minimizing configuration surfaces, reducing operational toil, and preventing overload and cascading failures in modern services with hundreds of tunables.
Syllabus
SREcon22 APAC - Real-Time Adaptive Controls for Resilient Distributed Systems
Taught by
USENIX
Related Courses
How to Not Destroy Your Production Kubernetes ClustersUSENIX via YouTube SRE and ML - Why It Matters
USENIX via YouTube Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube Tracing Bare Metal with OpenTelemetry
USENIX via YouTube Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube