Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training
Offered By: sentdex via YouTube
Course Description
Overview
Learn how to build a QLoRA fine-tuning dataset for language models in this comprehensive video tutorial. Explore various Reddit dataset options, including torrent files, Archive.org, and BigQuery. Follow step-by-step instructions on exporting BigQuery Reddit data, decompressing gzip archives, and recombining archives for target subreddits. Discover the proper data structure, build training samples, and save them to a database. Create customized training JSON files and dive into QLoRA training and results. Gain valuable insights into dataset building for fine-tuning language models through practical demonstrations and explanations.
Syllabus
- Introduction to Dataset building for fine-tuning.
- The Reddit dataset options Torrent, Archive.org, BigQuery
- Exporting BigQuery Reddit and some other data
- Decompressing all of the gzip archives
- Re-combining the archives for target subreddits
- How to structure the data
- Building training samples and saving to database
- Creating customized training json files
- QLoRA training and results
Taught by
sentdex
Related Courses
Serverless Data Analysis with Google BigQuery and Cloud Dataflow en FrançaisGoogle Cloud via Coursera Google Cloud Big Data and Machine Learning Fundamentals en Español
Google Cloud via Coursera Google Cloud Big Data and Machine Learning Fundamentals 日本語版
Google Cloud via Coursera Industrial IoT on Google Cloud
Google Cloud via Coursera Google Cloud Platform Big Data and Machine Learning Fundamentals em Português Brasileiro
Google Cloud via Coursera