Deploy LLM to Production on Single GPU - REST API for Falcon 7B with QLoRA on Inference Endpoints
Offered By: Venelin Valkov via YouTube
Course Description
Overview
Learn how to deploy a fine-tuned Falcon 7B language model with QLoRA to production using HuggingFace Inference Endpoints. Follow along as the process of merging the QLoRA adapter with the base model, pushing it to the HuggingFace Hub, and setting up a REST API is demonstrated. Gain insights into creating custom handlers for inference endpoints and testing the deployed model. This comprehensive tutorial covers everything from initial setup in Google Colab to the final steps of testing the REST API, providing a practical guide for deploying large language models in production environments.
Syllabus
- Introduction
- Text Tutorial on MLExpert.io
- Google Colab Setup
- Merge QLoRA adapter with Falcon 7B
- Push Model to HuggingFace Hub
- Inference with the Merged Model
- HuggingFace Inference Endpoints with Custom Handler
- Create Endpoint for the Deployment
- Test the Rest API
- Conclusion
Taught by
Venelin Valkov
Related Courses
Google BARD and ChatGPT AI for Increased ProductivityUdemy Bringing LLM to the Enterprise - Training From Scratch or Just Fine-Tune With Cerebras-GPT
Prodramp via YouTube Generative AI and Long-Term Memory for LLMs
James Briggs via YouTube Extractive Q&A With Haystack and FastAPI in Python
James Briggs via YouTube OpenAssistant First Models Are Here! - Open-Source ChatGPT
Yannic Kilcher via YouTube