Handling Multi-Terabyte LLM Checkpoints - MLOps Podcast #228
Offered By: MLOps.community via YouTube
Course Description
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of handling multi-terabyte LLM checkpoints in this insightful podcast episode featuring Simon Karasik, Machine Learning Engineer at Nebius AI. Delve into the challenges of LLM checkpointing, including checkpoint sizes and various techniques for saving and loading massive datasets. Gain valuable insights on selecting appropriate cloud storage options for checkpointing. Learn about Simon's diverse background in machine learning, covering areas such as ads, speech, and tax. Discover key topics like zombie model garbage collection, the evolution of LLMs, and the importance of confidence in AI training. Examine the differences between Slurm and Kubernetes, storage choice lessons, and essential components for setting up LLM infrastructure. Explore Argo workflows, Kubernetes node troubleshooting, and the complexities of fine-tuning, storage, and networking in LLM development. Benefit from practical advice on starting simple before advancing to more complex setups, and understanding model-specific needs in the rapidly evolving field of large language models.
Syllabus
[] Simon preferred beverage
[] Takeaways
[] Simon's tech background
[] Zombie models garbage collection
[] The road to LLMs
[] Trained models Simon worked on
[] LLM Checkpoints
[] Confidence in AI Training
[] Different Checkpoints
[] Checkpoint parts
[] Slurm vs Kubernetes
[] Storage choices lessons
[] Paramount components for setup
[] Argo workflows
[] Kubernetes node troubleshooting
[] Cloud virtual machines have pre-installed mentoring
[] Fine-tuning
[] Storage, networking, and complexity in network design
[] Start simple before advanced; consider model needs.
[] Join us at our first in-person conference on June 25 all about AI Quality
Taught by
MLOps.community
Related Courses
A Beginner’s Guide to DockerPackt via FutureLearn A Beginner's Guide to Kubernetes for Container Orchestration
Packt via FutureLearn A Practical Guide to Amazon EKS
A Cloud Guru Advanced Networking with Kubernetes on AWS
A Cloud Guru AIOps Essentials (Autoscaling Kubernetes with Prometheus Metrics)
A Cloud Guru