Vision Transformer and Its Applications
Offered By: Open Data Science via YouTube
Course Description
Overview
Syllabus
Intro
Vision Transformer (Vit) and its Applications
Why it matters?
Human Visual Attention
Attention is Dot Product between 2 Features
In Natural Language Processing
Image to Patches
Linear Projection - Patches to Features
Vision Transformer is Invariant to Position de Patches
Position Embedding
Learnable Class Embedding
Why Layer Norm?
Why Skip Connection?
Why Multi-Head Self-Attention?
A Transformer Encoder is Made of L Encode Modules Stacked Together
Version based on Layers, MLP size, MSA heaus
Pre-training on a large dataset, fine-tune or the target dataset
Training by Knowledge Distillation (Deit)
Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
Vision Transformer for STR (VITSTR)
Parameter, FLOPS, Speed Efficient
Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
Limitations
Recommended Open-Source Implementations of Vit
Taught by
Open Data Science
Related Courses
Introduction to Artificial IntelligenceStanford University via Udacity Computer Vision: The Fundamentals
University of California, Berkeley via Coursera Computational Photography
Georgia Institute of Technology via Coursera Einführung in Computer Vision
Technische Universität München (Technical University of Munich) via Coursera Introduction to Computer Vision
Georgia Institute of Technology via Udacity