YoVDO

Fast LLM Serving with vLLM and PagedAttention

Offered By: Anyscale via YouTube

Tags

vLLM Courses Machine Learning Courses Deep Learning Courses Chatbot Courses Distributed Systems Courses Transformers Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the innovative vLLM open-source library for fast LLM inference and serving in this 32-minute conference talk by Anyscale. Dive into the challenges of serving large language models and discover how vLLM, equipped with the novel PagedAttention algorithm, achieves up to 24x higher throughput than HuggingFace Transformers without requiring model architecture changes. Learn about the motivation, features, and implementation of vLLM, developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo. Gain insights into the future plans for this groundbreaking technology that promises to revolutionize AI usage across industries. Understand how vLLM effectively manages attention keys and values to overcome the limitations of traditional serving methods, making it an essential tool for developers and researchers working with LLMs.

Syllabus

Fast LLM Serving with vLLM and PagedAttention


Taught by

Anyscale

Related Courses

Linear Circuits
Georgia Institute of Technology via Coursera
مقدمة في هندسة الطاقة والقوى
King Abdulaziz University via Rwaq (رواق)
Magnetic Materials and Devices
Massachusetts Institute of Technology via edX
Linear Circuits 2: AC Analysis
Georgia Institute of Technology via Coursera
Transmisión de energía eléctrica
Tecnológico de Monterrey via edX