Accelerating LLM Inference with vLLM

Offered By: Databricks via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore the cutting-edge advancements in LLM inference performance through this 36-minute conference talk by Cade Daniel and Zhuohan Li. Dive into the world of vLLM, an open-source engine developed at UC Berkeley that has revolutionized LLM inference and serving. Learn about key performance-enhancing techniques such as paged attention and continuous batching. Discover recent innovations in vLLM, including Speculative Decoding, Prefix Caching, Disaggregated Prefill, and multi-accelerator support. Gain insights from industry case studies and get a glimpse of vLLM's future roadmap. Understand how vLLM's focus on production-readiness and extensibility has led to new system insights and widespread community adoption, making it a state-of-the-art, accelerator-agnostic solution for LLM inference.

Syllabus

Accelerating LLM Inference with vLLM

Taught by

Databricks

Related Courses

Finetuning, Serving, and Evaluating Large Language Models in the Wild
Open Data Science via YouTube Cloud Native Sustainable LLM Inference in Action
CNCF [Cloud Native Computing Foundation] via YouTube Optimizing Kubernetes Cluster Scaling for Advanced Generative Models
Linux Foundation via YouTube LLaMa for Developers
LinkedIn Learning Scaling Video Ad Classification Across Millions of Classes with GenAI
Databricks via YouTube

Accelerating LLM Inference with vLLM

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue