Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training
Offered By: sentdex via YouTube
Course Description
Overview
Learn how to build a QLoRA fine-tuning dataset for language models in this comprehensive video tutorial. Explore various Reddit dataset options, including torrent files, Archive.org, and BigQuery. Follow step-by-step instructions on exporting BigQuery Reddit data, decompressing gzip archives, and recombining archives for target subreddits. Discover the proper data structure, build training samples, and save them to a database. Create customized training JSON files and dive into QLoRA training and results. Gain valuable insights into dataset building for fine-tuning language models through practical demonstrations and explanations.
Syllabus
- Introduction to Dataset building for fine-tuning.
- The Reddit dataset options Torrent, Archive.org, BigQuery
- Exporting BigQuery Reddit and some other data
- Decompressing all of the gzip archives
- Re-combining the archives for target subreddits
- How to structure the data
- Building training samples and saving to database
- Creating customized training json files
- QLoRA training and results
Taught by
sentdex
Related Courses
Genomic Data Science and Clustering (Bioinformatics V)University of California, San Diego via Coursera 用Python玩转数据 Data Processing Using Python
Nanjing University via Coursera Data Mining Project
University of Illinois at Urbana-Champaign via Coursera Advanced Business Analytics Capstone
University of Colorado Boulder via Coursera Data Mining: Theories and Algorithms for Tackling Big Data | 数据挖掘:理论与算法
Tsinghua University via edX