DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving
Offered By: USENIX via YouTube
Course Description
Overview
Explore a 15-minute conference talk from USENIX OSDI '24 that introduces DistServe, a novel approach to improve large language model (LLM) serving performance. Learn how DistServe disaggregates prefill and decoding computation, assigning them to different GPUs to eliminate interference and optimize resource allocation. Discover how this method significantly enhances LLM serving performance by meeting stringent latency requirements for both time to first token (TTFT) and time per output token (TPOT). Understand the benefits of DistServe's co-optimization strategy and its ability to serve up to 7.4 times more requests or achieve 12.6 times tighter SLO compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests across various popular LLMs and applications.
Syllabus
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
Taught by
USENIX
Related Courses
Intro to Parallel ProgrammingNvidia via Udacity Introduction to Linear Models and Matrix Algebra
Harvard University via edX Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera Supercomputing
Partnership for Advanced Computing in Europe via FutureLearn Fundamentals of Parallelism on Intel Architecture
Intel via Coursera