YoVDO

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Offered By: Montreal Robotics via YouTube

Tags

Robotics Courses Machine Learning Courses Computer Vision Courses Transfer Learning Courses Generalization Courses Vision-Language Models Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the groundbreaking research on incorporating vision-language models trained on Internet-scale data into end-to-end robotic control. Delve into the study of how this integration enhances generalization and enables emergent semantic reasoning in robotics. Learn about the novel approach of co-fine-tuning state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks. Discover the innovative technique of expressing robotic actions as text tokens, allowing for seamless integration with natural language responses. Examine the concept of vision-language-action models (VLA) and the specific implementation known as RT-2. Analyze the extensive evaluation results, showcasing improved generalization to novel objects, interpretation of complex commands, and rudimentary reasoning abilities. Explore the potential of chain of thought reasoning in enabling multi-stage semantic reasoning for robotic tasks. Gain insights into the future possibilities of robotic control enhanced by large-scale pretraining on language and vision-language data from the web.

Syllabus

Yevgen Chebotar: RT-2- Vision-Language-Action Models Transfer Web Knowledge to Robotic Control


Taught by

Montreal Robotics

Related Courses

Mastering Google's PaliGemma VLM: Tips and Tricks for Success and Fine-Tuning
Sam Witteveen via YouTube
Fine-tuning PaliGemma for Custom Object Detection
Roboflow via YouTube
Florence-2: The Best Small Vision Language Model - Capabilities and Demo
Sam Witteveen via YouTube
Fine-tuning Florence-2: Microsoft's Multimodal Model for Custom Object Detection
Roboflow via YouTube
OpenVLA: An Open-Source Vision-Language-Action Model - Research Presentation
HuggingFace via YouTube