YoVDO

AWQ for LLM Quantization - Efficient Inference Framework for Large Language Models

Offered By: MIT HAN Lab via YouTube

Tags

Quantization Courses Edge Computing Courses Model Compression Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the innovative Activation-aware Weight Quantization (AWQ) technique for efficient large language model (LLM) deployment in this 21-minute video presentation by MIT HAN Lab. Learn how AWQ addresses the challenges of astronomical model sizes by protecting salient weights and optimizing per-channel scaling based on activation observations. Discover how this hardware-friendly approach outperforms existing methods in preserving LLMs' generalization abilities across various domains and modalities, including instruction-tuned and multi-modal models. Gain insights into the implementation of an efficient inference framework that significantly speeds up LLM deployment on both desktop and mobile GPUs, even enabling the use of 70B Llama-2 models on mobile devices. Understand the potential of AWQ in democratizing access to powerful language models and improving their performance in real-world applications.

Syllabus

AWQ for LLM Quantization


Taught by

MIT HAN Lab

Related Courses

Quantization Fundamentals with Hugging Face
DeepLearning.AI via Coursera
Quantization in Depth
DeepLearning.AI via Coursera
TensorFlow Lite for Edge Devices - Tutorial
freeCodeCamp
A Gentle Introduction to Sparsity with a Concrete Example
MLOps World: Machine Learning in Production via YouTube
Applying Second-Order Pruning Algorithms for SOTA Model Compression
Neural Magic via YouTube