YoVDO

DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving

Offered By: USENIX via YouTube

Tags

Parallel Computing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a 15-minute conference talk from USENIX OSDI '24 that introduces DistServe, a novel approach to improve large language model (LLM) serving performance. Learn how DistServe disaggregates prefill and decoding computation, assigning them to different GPUs to eliminate interference and optimize resource allocation. Discover how this method significantly enhances LLM serving performance by meeting stringent latency requirements for both time to first token (TTFT) and time per output token (TPOT). Understand the benefits of DistServe's co-optimization strategy and its ability to serve up to 7.4 times more requests or achieve 12.6 times tighter SLO compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests across various popular LLMs and applications.

Syllabus

OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...


Taught by

USENIX

Related Courses

Intro to Parallel Programming
Nvidia via Udacity
Introduction to Linear Models and Matrix Algebra
Harvard University via edX
Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera
Supercomputing
Partnership for Advanced Computing in Europe via FutureLearn
Fundamentals of Parallelism on Intel Architecture
Intel via Coursera