YoVDO

Collie - Finding Performance Anomalies in RDMA Subsystems

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses Simulated Annealing Courses System Administration Courses

Course Description

Overview

Explore a 15-minute conference talk from USENIX NSDI '22 that introduces Collie, a tool designed to uncover performance anomalies in RDMA subsystems. Learn how Collie constructs a comprehensive search space for application workloads and uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions. Discover the tool's effectiveness in finding 15 new performance anomalies across various RDMA NIC, CPU, and hardware component combinations. Gain insights into the challenges of defining performance anomalies, creating a comprehensive search space, and implementing efficient search algorithms. Understand the importance of hardware counters as search signals and the concept of Minimal Feature Set (MFS) in Collie's approach. Examine the evaluation settings, lessons learned, and future work directions for improving RDMA subsystem performance testing.

Syllabus

Intro
RDMA are getting rapidly adopted
There exist unexpected performance anomalies
Existing integration tests
Strawman solutions are not enough
Question: How to define performance anomaly
Challenge #1: Comprehensive Search Space
Challenge #2: Efficient Search Algorithm
Finding the narrow waist
Hardware counters as search signal
Minimal Feature Set (MFS)
Implementation
Evaluation Settings
Lessons and Future Work
Conclusion


Taught by

USENIX

Related Courses

Scaling Memcache at Facebook
USENIX via YouTube
Multi-Person Localization via RF Body Reflections
USENIX via YouTube
Opaque - An Oblivious and Encrypted Distributed Analytics Platform
USENIX via YouTube
Live Video Analytics at Scale with Approximation and Delay-Tolerance
USENIX via YouTube
Clipper - A Low-Latency Online Prediction Serving System
USENIX via YouTube