YoVDO

Deploy LLM to Production on Single GPU - REST API for Falcon 7B with QLoRA on Inference Endpoints

Offered By: Venelin Valkov via YouTube

Tags

LLM (Large Language Model) Courses REST APIs Courses Machine Learning Model Deployment Courses QLoRA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to deploy a fine-tuned Falcon 7B language model with QLoRA to production using HuggingFace Inference Endpoints. Follow along as the process of merging the QLoRA adapter with the base model, pushing it to the HuggingFace Hub, and setting up a REST API is demonstrated. Gain insights into creating custom handlers for inference endpoints and testing the deployed model. This comprehensive tutorial covers everything from initial setup in Google Colab to the final steps of testing the REST API, providing a practical guide for deploying large language models in production environments.

Syllabus

- Introduction
- Text Tutorial on MLExpert.io
- Google Colab Setup
- Merge QLoRA adapter with Falcon 7B
- Push Model to HuggingFace Hub
- Inference with the Merged Model
- HuggingFace Inference Endpoints with Custom Handler
- Create Endpoint for the Deployment
- Test the Rest API
- Conclusion


Taught by

Venelin Valkov

Related Courses

Fine-Tuning LLM with QLoRA on Single GPU - Training Falcon-7b on ChatBot Support FAQ Dataset
Venelin Valkov via YouTube
Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training
sentdex via YouTube
Generative AI: Fine-Tuning LLM Models Crash Course
Krish Naik via YouTube
Aligning Open Language Models - Stanford CS25 Lecture
Stanford University via YouTube
Fine-Tuning LLM Models - Generative AI Course
freeCodeCamp