Building a Flexible Data Platform for LLM Training Data
Offered By: Data Council via YouTube
Course Description
Overview
Dive into a 37-minute conference talk by Jonathan Talmi of Cohere, exploring the intricacies of building a flexible data platform for Large Language Model (LLM) training data. Gain insights from Cohere's experience in training LLMs from scratch, focusing on complex ingestion, preprocessing, and distillation pipelines. Examine the crucial role of data quality and explore their unique architecture designed to handle petabyte-scale datasets. Discover the science and practical implications of data for LLMs, and delve into the anatomy of an LLM training data pipeline. Learn about the challenges, successes, and technical aspects of scaling up to larger dataset sizes without relying on a distributed query engine. Acquire valuable knowledge about the latest developments in LLM training data management and processing techniques.
Syllabus
Building a Flexible Data Platform for LLM Training Data Rendered 4 9 24
Taught by
Data Council
Related Courses
Google Cloud Big Data and Machine Learning Fundamentals en EspañolGoogle Cloud via Coursera Data Analysis with Python
IBM via Coursera Intro to TensorFlow 日本語版
Google Cloud via Coursera TensorFlow on Google Cloud - Français
Google Cloud via Coursera Freedom of Data with SAP Data Hub
SAP Learning