YoVDO

Building a Flexible Data Platform for LLM Training Data

Offered By: Data Council via YouTube

Tags

Data Pipelines Courses Data Management Courses Data Preprocessing Courses Data Ingestion Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into a 37-minute conference talk by Jonathan Talmi of Cohere, exploring the intricacies of building a flexible data platform for Large Language Model (LLM) training data. Gain insights from Cohere's experience in training LLMs from scratch, focusing on complex ingestion, preprocessing, and distillation pipelines. Examine the crucial role of data quality and explore their unique architecture designed to handle petabyte-scale datasets. Discover the science and practical implications of data for LLMs, and delve into the anatomy of an LLM training data pipeline. Learn about the challenges, successes, and technical aspects of scaling up to larger dataset sizes without relying on a distributed query engine. Acquire valuable knowledge about the latest developments in LLM training data management and processing techniques.

Syllabus

Building a Flexible Data Platform for LLM Training Data Rendered 4 9 24


Taught by

Data Council

Related Courses

Deep Dive into Amazon Glacier
Amazon via Independent
Preparing for your Professional Data Engineer Journey
Google Cloud via Coursera
Building Resilient Streaming Systems on Google Cloud Platform en Français
Google Cloud via Coursera
IBM AI Enterprise Workflow
IBM via Coursera
Introduction to Designing Data Lakes on AWS
Amazon Web Services via edX