Characterization of Large Language Model Development in Datacenters
Offered By: USENIX via YouTube
Course Description
Overview
Explore an in-depth characterization study of Large Language Model (LLM) development in datacenters through this 17-minute conference talk from NSDI '24. Delve into the challenges and opportunities of efficiently utilizing large-scale cluster resources for LLM development, including hardware failures, parallelization strategies, and resource utilization. Examine the differences between LLMs and traditional task-specific Deep Learning workloads, and discover potential optimizations for LLM-tailored systems. Learn about innovative approaches such as fault-tolerant pretraining and decoupled scheduling for evaluation, designed to enhance fault tolerance and achieve timely performance feedback in LLM development environments.
Syllabus
NSDI '24 - Characterization of Large Language Model Development in the Datacenter
Taught by
USENIX
Related Courses
Neural Networks for Machine LearningUniversity of Toronto via Coursera 機器學習技法 (Machine Learning Techniques)
National Taiwan University via Coursera Machine Learning Capstone: An Intelligent Application with Deep Learning
University of Washington via Coursera Прикладные задачи анализа данных
Moscow Institute of Physics and Technology via Coursera Leading Ambitious Teaching and Learning
Microsoft via edX