YoVDO

Data Preparation Tips and Tricks for Machine Learning

Offered By: Trelis Research via YouTube

Tags

Data Preparation Courses Machine Learning Courses Clustering Courses Fine-Tuning Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore data preparation techniques for machine learning in this comprehensive one-hour video tutorial. Learn about filtering and deduplication using FineWeb, balance concepts with hierarchical k-means filtering, and see a live demonstration of dataset balancing using OpenAssistant. Dive into topics like handling labeled data, setting chat templates for tokenizers, addressing hallucinations, and working with mixed-language datasets. Gain insights on text classification models, extracting structured data from PDFs, multi-GPU training, and implementing RAG pipelines. Access additional resources and a Colab notebook to enhance your understanding of data preparation strategies for optimal machine learning outcomes.

Syllabus

Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions


Taught by

Trelis Research

Related Courses

Graph Partitioning and Expanders
Stanford University via NovoEd
The Analytics Edge
Massachusetts Institute of Technology via edX
More Data Mining with Weka
University of Waikato via Independent
Mining Massive Datasets
Stanford University via edX
The Caltech-JPL Summer School on Big Data Analytics
California Institute of Technology via Coursera