Vision Transformer and Its Applications
Offered By: Open Data Science via YouTube
Course Description
Overview
Syllabus
Intro
Vision Transformer (Vit) and its Applications
Why it matters?
Human Visual Attention
Attention is Dot Product between 2 Features
In Natural Language Processing
Image to Patches
Linear Projection - Patches to Features
Vision Transformer is Invariant to Position de Patches
Position Embedding
Learnable Class Embedding
Why Layer Norm?
Why Skip Connection?
Why Multi-Head Self-Attention?
A Transformer Encoder is Made of L Encode Modules Stacked Together
Version based on Layers, MLP size, MSA heaus
Pre-training on a large dataset, fine-tune or the target dataset
Training by Knowledge Distillation (Deit)
Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
Vision Transformer for STR (VITSTR)
Parameter, FLOPS, Speed Efficient
Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
Limitations
Recommended Open-Source Implementations of Vit
Taught by
Open Data Science
Related Courses
Transformers: Text Classification for NLP Using BERTLinkedIn Learning TensorFlow: Working with NLP
LinkedIn Learning TransGAN - Two Transformers Can Make One Strong GAN - Machine Learning Research Paper Explained
Yannic Kilcher via YouTube Nyströmformer- A Nyström-Based Algorithm for Approximating Self-Attention
Yannic Kilcher via YouTube Recreate Google Translate - Model Training
Edan Meyer via YouTube