Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Offered By: USENIX via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore a conference talk that delves into Quant-LLM, an innovative approach to accelerate the serving of large language models through FP6-centric algorithm-system co-design on modern GPUs. Learn about the challenges of supporting FP6 quantization on GPUs and discover TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for 6-bit and arbitrary bit-width quantization. Understand how Quant-LLM integrates TC-FPx kernel into existing inference systems, providing new end-to-end support for quantized LLM inference and achieving better trade-offs between inference cost and model quality. Examine experimental results demonstrating Quant-LLM's ability to enable LLaMA-70b inference on a single GPU, achieving significantly higher normalized inference throughput compared to the FP16 baseline. Access the publicly available source code and gain insights into the potential of six-bit quantization for effectively reducing LLM size while preserving model quality across various applications.

Syllabus

USENIX ATC '24 - Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric...

Taught by

USENIX

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue