YoVDO

How Fast Can Your Model Composition Run in Serverless Inference?

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Machine Learning Courses Kubernetes Courses LLM (Large Language Model) Courses Embeddings Courses Retrieval Augmented Generation (RAG) Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the challenges and solutions for efficient multi-model composition and inference in serverless Kubernetes environments in this conference talk. Learn how the integration of BentoML with Dragonfly addresses slow deployment times, high operational costs, and scalability issues when serving interconnected suites of ML models. Discover a compelling case study of a RAG application combining LLM, embedding, and OCR models, showcasing efficient packaging and swift distribution through Dragonfly's innovative P2P network. Delve into the utilization of open-source technologies like JuiceFS and VLLM to achieve remarkable deployment times of just 40 seconds and establish a scalable blueprint for multi-model composition deployments. Gain insights into transforming the landscape of AI model serving and overcoming complexities in typical AI applications requiring multiple interconnected models.

Syllabus

How Fast Can Your Model Composition Run in Serverless Inference? - Fog Dong, BentoML & Wenbo Qi


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Better Llama with Retrieval Augmented Generation - RAG
James Briggs via YouTube
Live Code Review - Pinecone Vercel Starter Template and Retrieval Augmented Generation
Pinecone via YouTube
Nvidia's NeMo Guardrails - Full Walkthrough for Chatbots - AI
James Briggs via YouTube
Hugging Face LLMs with SageMaker - RAG with Pinecone
James Briggs via YouTube
Supercharge Your LLM Applications with RAG
Data Science Dojo via YouTube