Fast LLM Serving with vLLM and PagedAttention
Offered By: Anyscale via YouTube
Course Description
Overview
Explore the innovative vLLM open-source library for fast LLM inference and serving in this 32-minute conference talk by Anyscale. Dive into the challenges of serving large language models and discover how vLLM, equipped with the novel PagedAttention algorithm, achieves up to 24x higher throughput than HuggingFace Transformers without requiring model architecture changes. Learn about the motivation, features, and implementation of vLLM, developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo. Gain insights into the future plans for this groundbreaking technology that promises to revolutionize AI usage across industries. Understand how vLLM effectively manages attention keys and values to overcome the limitations of traditional serving methods, making it an essential tool for developers and researchers working with LLMs.
Syllabus
Fast LLM Serving with vLLM and PagedAttention
Taught by
Anyscale
Related Courses
Finetuning, Serving, and Evaluating Large Language Models in the WildOpen Data Science via YouTube Cloud Native Sustainable LLM Inference in Action
CNCF [Cloud Native Computing Foundation] via YouTube Optimizing Kubernetes Cluster Scaling for Advanced Generative Models
Linux Foundation via YouTube LLaMa for Developers
LinkedIn Learning Scaling Video Ad Classification Across Millions of Classes with GenAI
Databricks via YouTube