Databricks' vLLM Optimization for Cost-Effective LLM Inference - Ray Summit 2024
Offered By: Anyscale via YouTube
Course Description
Overview
Explore Databricks' innovative approach to optimizing vLLM for enhanced LLM inference performance in this Ray Summit 2024 presentation. Discover how Megha Agarwal and her team at Databricks (MosaicML) tackle the challenges of GPU blocking operations during decoding steps, which can significantly impact performance for large models. Learn about their solutions to reduce GPU idle time and accelerate quantization using custom kernels. Gain valuable insights into future optimization areas and best practices for benchmarking LLM deployments. Ideal for organizations and developers working on large-scale LLM projects, this talk offers practical strategies to improve inference efficiency and reduce costs in LLM serving products.
Syllabus
Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024
Taught by
Anyscale
Related Courses
Introduction to Artificial IntelligenceStanford University via Udacity Natural Language Processing
Columbia University via Coursera Probabilistic Graphical Models 1: Representation
Stanford University via Coursera Computer Vision: The Fundamentals
University of California, Berkeley via Coursera Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent