YoVDO

Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training

Offered By: sentdex via YouTube

Tags

Machine Learning Courses BigQuery Courses JSON Courses Data Preprocessing Courses Language Models Courses Fine-Tuning Courses QLoRA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to build a QLoRA fine-tuning dataset for language models in this comprehensive video tutorial. Explore various Reddit dataset options, including torrent files, Archive.org, and BigQuery. Follow step-by-step instructions on exporting BigQuery Reddit data, decompressing gzip archives, and recombining archives for target subreddits. Discover the proper data structure, build training samples, and save them to a database. Create customized training JSON files and dive into QLoRA training and results. Gain valuable insights into dataset building for fine-tuning language models through practical demonstrations and explanations.

Syllabus

- Introduction to Dataset building for fine-tuning.
- The Reddit dataset options Torrent, Archive.org, BigQuery
- Exporting BigQuery Reddit and some other data
- Decompressing all of the gzip archives
- Re-combining the archives for target subreddits
- How to structure the data
- Building training samples and saving to database
- Creating customized training json files
- QLoRA training and results


Taught by

sentdex

Related Courses

Genomic Data Science and Clustering (Bioinformatics V)
University of California, San Diego via Coursera
用Python玩转数据 Data Processing Using Python
Nanjing University via Coursera
Data Mining Project
University of Illinois at Urbana-Champaign via Coursera
Advanced Business Analytics Capstone
University of Colorado Boulder via Coursera
Data Mining: Theories and Algorithms for Tackling Big Data | 数据挖掘:理论与算法
Tsinghua University via edX