Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design
Offered By: USENIX via YouTube
Course Description
Overview
Explore a conference talk that delves into Quant-LLM, an innovative approach to accelerate the serving of large language models through FP6-centric algorithm-system co-design on modern GPUs. Learn about the challenges of supporting FP6 quantization on GPUs and discover TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for 6-bit and arbitrary bit-width quantization. Understand how Quant-LLM integrates TC-FPx kernel into existing inference systems, providing new end-to-end support for quantized LLM inference and achieving better trade-offs between inference cost and model quality. Examine experimental results demonstrating Quant-LLM's ability to enable LLaMA-70b inference on a single GPU, achieving significantly higher normalized inference throughput compared to the FP16 baseline. Access the publicly available source code and gain insights into the potential of six-bit quantization for effectively reducing LLM size while preserving model quality across various applications.
Syllabus
USENIX ATC '24 - Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric...
Taught by
USENIX
Related Courses
LLaMA- Open and Efficient Foundation Language Models - Paper ExplainedYannic Kilcher via YouTube Alpaca & LLaMA - Can it Compete with ChatGPT?
Venelin Valkov via YouTube Experimenting with Alpaca & LLaMA
Aladdin Persson via YouTube What's LLaMA? ChatLLaMA? - And Some ChatGPT/InstructGPT
Aladdin Persson via YouTube Llama Index - Step by Step Introduction
echohive via YouTube