How to Pick a GPU and Inference Engine for Large Language Models
Offered By: Trelis Research via YouTube
Course Description
Overview
Syllabus
How to pick a GPU and software for inference
Video Overview
Effect of Quantization on Quality
Effect of Quantization on Speed
Effect of GPU bandwidth relative to model size
Effect of de-quantization on inference speed
Marlin Kernels, AWQ and Neural Magic
Inference Software - vLLM, TGI, SGLang, NIM
Deploying one-click templates for inference
Testing inference speed for a batch size of 1 and 64
SGLang inference speed
vLLM inference speed
Text Generation Inference Speed
Nvidia NIM Inference Speed
Comparing vLLM, SGLang, TGI and NIM Inference Speed.
Comparing inference costs for A40, A6000, A100 and H100
Inference Setup for Llama 3.1 70B and 405B
Running inference on Llama 8B on A40, A6000, A100 and H100
Inference cost comparison for Llama 8B
Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
Inference cost comparison for Llama 70B and 405B
OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
Final Inference Tips
Resources
Taught by
Trelis Research
Related Courses
Digital Signal ProcessingÉcole Polytechnique Fédérale de Lausanne via Coursera Principles of Communication Systems - I
Indian Institute of Technology Kanpur via Swayam Digital Signal Processing 2: Filtering
École Polytechnique Fédérale de Lausanne via Coursera Digital Signal Processing 3: Analog vs Digital
École Polytechnique Fédérale de Lausanne via Coursera Digital Signal Processing 4: Applications
École Polytechnique Fédérale de Lausanne via Coursera