YoVDO

GShard- Scaling Giant Models with Conditional Computation and Automatic Sharding

Offered By: Yannic Kilcher via YouTube

Tags

Neural Networks Courses Machine Learning Courses Routing Algorithms Courses Backpropagation Courses Mixture-of-Experts Courses

Course Description

Overview

Dive into an in-depth explanation of Google's groundbreaking 600 billion parameter transformer model for massively multilingual machine translation. Explore the innovative approach of increasing model width in feedforward layers and implementing hard routing for parallel computation across 2048 TPUs. Learn about the Mixture-of-Experts architecture, its routing algorithm, and how it differs from scaling classic transformers. Examine GShard, a module that simplifies parallel computation expression, and its application in automatic sharding. Discover the intricacies of massively multilingual translation and analyze the impressive results achieved by this giant model. Gain insights into the future of large-scale language models and their potential impact on machine translation technology.

Syllabus

- Intro & Overview
- Main Results
- Mixture-of-Experts
- Difference to Scaling Classic Transformers
- Backpropagation in Mixture-of-Experts
- MoE Routing Algorithm in GShard
- GShard Einsum Examples
- Massively Multilingual Translation
- Results
- Conclusion & Comments


Taught by

Yannic Kilcher

Related Courses

Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments
Association for Computing Machinery (ACM) via YouTube
Modules and Architectures
Alfredo Canziani via YouTube
Stanford Seminar - Mixture of Experts Paradigm and the Switch Transformer
Stanford University via YouTube
Decoding Mistral AI's Large Language Models - Building Blocks and Training Strategies
Databricks via YouTube
Pioneering a Hybrid SSM Transformer Architecture - Jamba Foundation Model
Databricks via YouTube