How to Pick a GPU and Inference Engine for Large Language Models
Offered By: Trelis Research via YouTube
Course Description
Overview
Syllabus
How to pick a GPU and software for inference
Video Overview
Effect of Quantization on Quality
Effect of Quantization on Speed
Effect of GPU bandwidth relative to model size
Effect of de-quantization on inference speed
Marlin Kernels, AWQ and Neural Magic
Inference Software - vLLM, TGI, SGLang, NIM
Deploying one-click templates for inference
Testing inference speed for a batch size of 1 and 64
SGLang inference speed
vLLM inference speed
Text Generation Inference Speed
Nvidia NIM Inference Speed
Comparing vLLM, SGLang, TGI and NIM Inference Speed.
Comparing inference costs for A40, A6000, A100 and H100
Inference Setup for Llama 3.1 70B and 405B
Running inference on Llama 8B on A40, A6000, A100 and H100
Inference cost comparison for Llama 8B
Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
Inference cost comparison for Llama 70B and 405B
OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
Final Inference Tips
Resources
Taught by
Trelis Research
Related Courses
Finetuning, Serving, and Evaluating Large Language Models in the WildOpen Data Science via YouTube Cloud Native Sustainable LLM Inference in Action
CNCF [Cloud Native Computing Foundation] via YouTube Optimizing Kubernetes Cluster Scaling for Advanced Generative Models
Linux Foundation via YouTube LLaMa for Developers
LinkedIn Learning Scaling Video Ad Classification Across Millions of Classes with GenAI
Databricks via YouTube