Accelerating LLM Inference with vLLM
Offered By: Databricks via YouTube
Course Description
Overview
Explore the cutting-edge advancements in LLM inference performance through this 36-minute conference talk by Cade Daniel and Zhuohan Li. Dive into the world of vLLM, an open-source engine developed at UC Berkeley that has revolutionized LLM inference and serving. Learn about key performance-enhancing techniques such as paged attention and continuous batching. Discover recent innovations in vLLM, including Speculative Decoding, Prefix Caching, Disaggregated Prefill, and multi-accelerator support. Gain insights from industry case studies and get a glimpse of vLLM's future roadmap. Understand how vLLM's focus on production-readiness and extensibility has led to new system insights and widespread community adoption, making it a state-of-the-art, accelerator-agnostic solution for LLM inference.
Syllabus
Accelerating LLM Inference with vLLM
Taught by
Databricks
Related Courses
Finetuning, Serving, and Evaluating Large Language Models in the WildOpen Data Science via YouTube Cloud Native Sustainable LLM Inference in Action
CNCF [Cloud Native Computing Foundation] via YouTube Optimizing Kubernetes Cluster Scaling for Advanced Generative Models
Linux Foundation via YouTube LLaMa for Developers
LinkedIn Learning Scaling Video Ad Classification Across Millions of Classes with GenAI
Databricks via YouTube