A Chiplet-Based Generative Inference Architecture with Block Floating Point Datatypes
Offered By: Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Course Description
Overview
Explore a comprehensive conference talk on chiplet-based generative inference architecture and block floating point datatypes for AI acceleration. Delve into modular, spatial CGRA-like architectures optimized for generative inference, and learn about deep RL-based mappers in compilers for spatial and temporal architectures. Discover weight and activation quantization techniques in block floating point formats, building upon GPTQ and SmoothQuant, and their implementation in PyTorch. Examine an extension to EL-attention for reducing KV cache size and bandwidth. Gain insights from speaker Sudeep Bhoja in this SPCL_Bcast #38 recording from ETH Zurich's Scalable Parallel Computing Lab, featuring an in-depth presentation followed by announcements and a Q&A session.
Syllabus
Introduction
Talk
Announcements
Q&A Session
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich
Related Courses
Sequence ModelsDeepLearning.AI via Coursera Modern Natural Language Processing in Python
Udemy Stanford Seminar - Transformers in Language: The Development of GPT Models Including GPT-3
Stanford University via YouTube Long Form Question Answering in Haystack
James Briggs via YouTube Spotify's Podcast Search Explained
James Briggs via YouTube