Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots
Offered By: USENIX via YouTube
Course Description
Overview
Syllabus
Intro
Why Large Bare Metal Boxes? • Faster local communication UNIX Domain Sockets Shared Memory
The Scale in our Department • 100K processes across hundreds of physical machines
SysV semaphore bottleneck (AIX)
Observations and Findings AIX CPU measurement when hyper-threading is very misleading No 'out of the box metrics on SysV IPC operations Sporadic slowness (depending on concurrency/contention)
SysV shared memory bottleneck (Linux) • Low-level application infrastructure code dropping messages Messaging leverages a form of "zero copy" IPC using Sysv
SysV shared memory bottleneck (Linux RHEL 6) The micro-benchmark
Case #2: Observations and Findings • No 'out of the box metrics on SysV IPC operations
UNIX domain socket bottleneck (Solaris) • Critical software infrastructure experiencing timeouts on load Identity management with very strict SLOS Narrowing down the problem A key SLI for the service is token generation latency
An Aside: Histograms and Distributions are Useful! • More representative of the data set
An Aside: A Histogram Example
Early Observations • No out of the box metrics on socket operations
Case #3: UNIX domain socket bottleneck (Solaris) The micro-benchmarkt-testing against size
Case #3: Conclusions • Solaris 11.3 is limited to a max of 256K UDS sockets
Task clone and exit bottleneck (Linux)
More Summary (Plea to Kernel Folks) • The Prime Directive of Monitoring: Non-interference
References
Taught by
USENIX
Related Courses
How to Not Destroy Your Production Kubernetes ClustersUSENIX via YouTube SRE and ML - Why It Matters
USENIX via YouTube Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube Tracing Bare Metal with OpenTelemetry
USENIX via YouTube Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube