dLoRA - Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
Offered By: USENIX via YouTube
Course Description
Overview
Explore a cutting-edge conference talk on dLoRA, an innovative inference serving system for LoRA (Low-Rank Adaptation) models in large language model (LLM) serving. Delve into the dynamic orchestration of requests and LoRA adapters, focusing on two key aspects: dynamically merging and unmerging adapters with the base model, and migrating requests and adapters between worker replicas. Discover the insights behind these capabilities, including the impact of request skewness on adapter merging decisions and the load imbalance caused by varying input and output lengths in autoregressive LLM requests. Learn about the credit-based batching algorithm for merge/unmerge decisions and the request-adapter co-migration algorithm. Examine the impressive performance improvements achieved by dLoRA, with throughput increases of up to 57.9× and 26.0× compared to vLLM and HugginFace PEFT, respectively, and up to 1.8× lower average latency than the concurrent work S-LoRA.
Syllabus
OSDI '24 - dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
Taught by
USENIX
Related Courses
Designing Highly Scalable Web Apps on Google Cloud PlatformGoogle via Coursera Google Cloud Platform for AWS Professionals
Google via Coursera Elastic Google Cloud Infrastructure: Scaling and Automation
Google Cloud via Coursera Windows Server 2016: Advanced Virtualization
Microsoft via edX Elastic Cloud Infrastructure: Scaling and Automation 日本語版
Google Cloud via Coursera