YoVDO

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Offered By: USENIX via YouTube

Tags

Attention Mechanisms Courses Positional Encoding Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a cutting-edge approach to optimizing large language model (LLM) serving for multi-turn conversations in this 22-minute conference talk from USENIX ATC '24. Dive into the innovative CachedAttention mechanism, designed to significantly reduce computational overheads and serving costs associated with LLM interactions. Learn how this new attention mechanism enables the reuse of key-value (KV) caches across conversations, employing a hierarchical caching system and intelligent scheduling techniques. Discover strategies for efficient KV cache management, including layer-wise pre-loading, asynchronous saving, and scheduler-aware fetching and eviction schemes. Understand how CachedAttention addresses the challenge of context window overflow while maintaining the validity of saved KV caches. Examine the impressive experimental results, showcasing substantial improvements in time to first token, prompt prefilling throughput, and overall inference cost reduction for multi-turn conversations with LLMs.

Syllabus

USENIX ATC '24 - Cost-Efficient Large Language Model Serving for Multi-turn Conversations with...


Taught by

USENIX

Related Courses

Attention Mechanism
Google Cloud via Coursera
Attention Mechanism
Google via Google Cloud Skills Boost
Attention Mechanism - Italiano
Google Cloud via Coursera
Attention Mechanism - 한국어
Google Cloud via Coursera
Attention Mechanism - Português Brasileiro
Google Cloud via Coursera