DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving

Offered By: USENIX via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore a 15-minute conference talk from USENIX OSDI '24 that introduces DistServe, a novel approach to improve large language model (LLM) serving performance. Learn how DistServe disaggregates prefill and decoding computation, assigning them to different GPUs to eliminate interference and optimize resource allocation. Discover how this method significantly enhances LLM serving performance by meeting stringent latency requirements for both time to first token (TTFT) and time per output token (TPOT). Understand the benefits of DistServe's co-optimization strategy and its ability to serve up to 7.4 times more requests or achieve 12.6 times tighter SLO compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests across various popular LLMs and applications.

Syllabus

OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...

Taught by

USENIX

DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue