Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots
Offered By: USENIX via YouTube
Course Description
Overview
Syllabus
Intro
Why Large Bare Metal Boxes? • Faster local communication UNIX Domain Sockets Shared Memory
The Scale in our Department • 100K processes across hundreds of physical machines
SysV semaphore bottleneck (AIX)
Observations and Findings AIX CPU measurement when hyper-threading is very misleading No 'out of the box metrics on SysV IPC operations Sporadic slowness (depending on concurrency/contention)
SysV shared memory bottleneck (Linux) • Low-level application infrastructure code dropping messages Messaging leverages a form of "zero copy" IPC using Sysv
SysV shared memory bottleneck (Linux RHEL 6) The micro-benchmark
Case #2: Observations and Findings • No 'out of the box metrics on SysV IPC operations
UNIX domain socket bottleneck (Solaris) • Critical software infrastructure experiencing timeouts on load Identity management with very strict SLOS Narrowing down the problem A key SLI for the service is token generation latency
An Aside: Histograms and Distributions are Useful! • More representative of the data set
An Aside: A Histogram Example
Early Observations • No out of the box metrics on socket operations
Case #3: UNIX domain socket bottleneck (Solaris) The micro-benchmarkt-testing against size
Case #3: Conclusions • Solaris 11.3 is limited to a max of 256K UDS sockets
Task clone and exit bottleneck (Linux)
More Summary (Plea to Kernel Folks) • The Prime Directive of Monitoring: Non-interference
References
Taught by
USENIX
Related Courses
Introduction to Enterprise ComputingMarist College via Independent Advanced Operating Systems
Georgia Institute of Technology via Udacity Programmation sur iPhone et iPad (partie I)
Université Pierre et Marie CURIE via France Université Numerique 操作系统原理(Operating Systems)
Peking University via Coursera Introduction to Operating Systems
Georgia Institute of Technology via Udacity