YoVDO

How to Pick a GPU and Inference Engine for Large Language Models

Offered By: Trelis Research via YouTube

Tags

Model Optimization Courses GPT-4 Courses LLaMA (Large Language Model Meta AI) Courses Quantization Courses vLLM Courses NVIDIA NIM Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into a comprehensive video tutorial on selecting the right GPU and inference engine for machine learning projects. Learn about the impact of quantization on model quality and speed, the relationship between GPU bandwidth and model size, and the effects of de-quantization on inference speed. Explore advanced topics like Marlin Kernels, AWQ, and Neural Magic. Compare popular inference software including vLLM, TGI, SGLang, and NIM, and discover how to deploy one-click templates for inference. Analyze detailed performance comparisons across various GPUs (A40, A6000, A100, H100) and model sizes (Llama 3.1 8B, 70B, 405B), including cost considerations. Gain insights into OpenAI GPT4 inference costs compared to Llama models. Conclude with valuable tips for optimizing inference setups and access additional resources for further learning.

Syllabus

How to pick a GPU and software for inference
Video Overview
Effect of Quantization on Quality
Effect of Quantization on Speed
Effect of GPU bandwidth relative to model size
Effect of de-quantization on inference speed
Marlin Kernels, AWQ and Neural Magic
Inference Software - vLLM, TGI, SGLang, NIM
Deploying one-click templates for inference
Testing inference speed for a batch size of 1 and 64
SGLang inference speed
vLLM inference speed
Text Generation Inference Speed
Nvidia NIM Inference Speed
Comparing vLLM, SGLang, TGI and NIM Inference Speed.
Comparing inference costs for A40, A6000, A100 and H100
Inference Setup for Llama 3.1 70B and 405B
Running inference on Llama 8B on A40, A6000, A100 and H100
Inference cost comparison for Llama 8B
Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
Inference cost comparison for Llama 70B and 405B
OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
Final Inference Tips
Resources


Taught by

Trelis Research

Related Courses

NVIDIA NIM - Deploy Accelerated AI in 5 Minutes
All About AI via YouTube
NVIDIA NIM: Deploying and Integrating Generative AI Models in Applications
Mervin Praison via YouTube
Development and Deployment of Generative AI with NVIDIA
Databricks via YouTube
Building Multimodal AI RAG with LlamaIndex, NVIDIA NIM, and Milvus - LLM App Development
Nvidia via YouTube
Building LLM Assistants with LlamaIndex, NVIDIA NIM, and Milvus - LLM App Development
Nvidia via YouTube