YoVDO

Double Inference Speed with AWQ Quantization

Offered By: Trelis Research via YouTube

Tags

Quantization Courses Model Deployment Courses Language Models Courses Text Generation Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to double inference speed using AWQ quantization in this 23-minute tutorial from Trelis Research. Explore techniques to deploy a Llama 2 70B server and API with AWQ, set up chat-ui for Llama 2, and run AWQ in Google Colab. Dive into the mechanics of AWQ quantization, understand how quantization works for language models, and compare it with GPTQ. Gain practical insights on increasing inference speed and accuracy, along with pro tips for implementation. Access additional resources including a Colab notebook, research papers, GitHub repositories, and model links to enhance your understanding and application of AWQ quantization techniques.

Syllabus

Increase inference speed and accuracy with AWQ
Deploy a Llama 2 70B server and API with AWQ
How to set up chat-ui for Llama 2
How to run AWQ in Google Colab
How does AWQ quantization work?
How does quantization work for language models?
How does GPTQ work?
Pro tips


Taught by

Trelis Research

Related Courses

Digital Signal Processing
École Polytechnique Fédérale de Lausanne via Coursera
Principles of Communication Systems - I
Indian Institute of Technology Kanpur via Swayam
Digital Signal Processing 2: Filtering
École Polytechnique Fédérale de Lausanne via Coursera
Digital Signal Processing 3: Analog vs Digital
École Polytechnique Fédérale de Lausanne via Coursera
Digital Signal Processing 4: Applications
École Polytechnique Fédérale de Lausanne via Coursera