YoVDO

Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots

Offered By: USENIX via YouTube

Tags

SREcon Courses Data Analysis Courses Operating Systems Courses

Course Description

Overview

Explore kernel hotspots and performance bottlenecks in large-scale systems through a conference talk that delves into latency distributions and micro-benchmarking techniques. Learn how to identify and characterize OS scale limits across various operating systems, including AIX, Linux, and Solaris. Discover case studies highlighting issues with SysV semaphores, shared memory, UNIX domain sockets, and task cloning. Gain insights into developing fixes and workarounds for these performance constraints, and understand the importance of complementing modern tracing facilities with focused micro-benchmarks. Examine the challenges of running bare metal hardware with many cores and terabytes of main memory, and the impact on system scalability.

Syllabus

Intro
Why Large Bare Metal Boxes? • Faster local communication UNIX Domain Sockets Shared Memory
The Scale in our Department • 100K processes across hundreds of physical machines
SysV semaphore bottleneck (AIX)
Observations and Findings AIX CPU measurement when hyper-threading is very misleading No 'out of the box metrics on SysV IPC operations Sporadic slowness (depending on concurrency/contention)
SysV shared memory bottleneck (Linux) • Low-level application infrastructure code dropping messages Messaging leverages a form of "zero copy" IPC using Sysv
SysV shared memory bottleneck (Linux RHEL 6) The micro-benchmark
Case #2: Observations and Findings • No 'out of the box metrics on SysV IPC operations
UNIX domain socket bottleneck (Solaris) • Critical software infrastructure experiencing timeouts on load Identity management with very strict SLOS Narrowing down the problem A key SLI for the service is token generation latency
An Aside: Histograms and Distributions are Useful! • More representative of the data set
An Aside: A Histogram Example
Early Observations • No out of the box metrics on socket operations
Case #3: UNIX domain socket bottleneck (Solaris) The micro-benchmarkt-testing against size
Case #3: Conclusions • Solaris 11.3 is limited to a max of 256K UDS sockets
Task clone and exit bottleneck (Linux)
More Summary (Plea to Kernel Folks) • The Prime Directive of Monitoring: Non-interference
References


Taught by

USENIX

Related Courses

How to Not Destroy Your Production Kubernetes Clusters
USENIX via YouTube
SRE and ML - Why It Matters
USENIX via YouTube
Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube
Tracing Bare Metal with OpenTelemetry
USENIX via YouTube
Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube