Stanford Seminar - Mixture of Experts Paradigm and the Switch Transformer
Offered By: Stanford University via YouTube
Course Description
Overview
Explore the groundbreaking Mixture of Experts (MoE) paradigm and the Switch Transformer in this Stanford seminar. Delve into how MoE challenges traditional deep learning models by selecting different parameters for each input, resulting in sparsely-activated models with vast numbers of parameters but constant computational cost. Learn about the simplification of MoE routing algorithms, improved model designs with reduced communication and computational costs, and innovative training techniques that address instabilities. Discover how large sparse models can be trained using lower precision formats, leading to significant increases in pre-training speed. Examine the application of these improvements in multilingual settings and the advancement of language model scale to trillion-parameter models. Gain insights from research scientists Barret Zoph and Irwan Bello as they discuss their work on various deep learning topics, including neural architecture search, data augmentation, semi-supervised learning, and model sparsity.
Syllabus
CS25 I Stanford Seminar 2022 - Mixture of Experts (MoE) paradigm and the Switch Transformer
Taught by
Stanford Online
Tags
Related Courses
GShard- Scaling Giant Models with Conditional Computation and Automatic ShardingYannic Kilcher via YouTube Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments
Association for Computing Machinery (ACM) via YouTube Modules and Architectures
Alfredo Canziani via YouTube Decoding Mistral AI's Large Language Models - Building Blocks and Training Strategies
Databricks via YouTube Pioneering a Hybrid SSM Transformer Architecture - Jamba Foundation Model
Databricks via YouTube