YoVDO

Fine-tuning Multi-modal LLaVA Vision and Language Models

Offered By: Trelis Research via YouTube

Tags

Computer Vision Courses LoRA (Low-Rank Adaptation) Courses Fine-Tuning Courses LLaVA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to fine-tune multi-modal vision and language models like LLaVA in this comprehensive tutorial. Explore the architectures of LLaVA 1.5, LLaVA 1.6, and IDEFICS, and understand their applications compared to ChatGPT. Dive into the intricacies of vision encoder architecture and multi-modal model design. Master the process of data creation, dataset preparation, and fine-tuning techniques. Gain hands-on experience with data loading, LoRA setup, and evaluation methods. Follow along with practical demonstrations on training, inference, and post-training evaluation. Clarify technical concepts and summarize key takeaways to enhance your skills in working with advanced vision and language models.

Syllabus

Fine-tuning Multi-modal Models
Overview
LLaVA vs ChatGPT
Applications
Multi-modal model architecture
Vision Encoder architecture
LLaVA 1.5 architecture
LLaVA 1.6 architecture
IDEFICS architecture
Data creation
Dataset creation
Fine-tuning
Inference and Evaluation
Data loading
LoRA setup
Recap so far
Training
Evaluation post-training
Technical clarifications
Summary


Taught by

Trelis Research

Related Courses

LLaVA: The New Open Access Multimodal AI Model
1littlecoder via YouTube
Autogen and Local LLMs Create Realistic Stable Diffusion Model Autonomously
kasukanra via YouTube
Image Annotation with LLaVA and Ollama
Sam Witteveen via YouTube
Unraveling Multimodality with Large Language Models
Linux Foundation via YouTube
Efficient and Portable AI/LLM Inference on the Edge Cloud - Workshop
Linux Foundation via YouTube