Data Preparation Tips and Tricks for Machine Learning
Offered By: Trelis Research via YouTube
Course Description
Overview
Explore data preparation techniques for machine learning in this comprehensive one-hour video tutorial. Learn about filtering and deduplication using FineWeb, balance concepts with hierarchical k-means filtering, and see a live demonstration of dataset balancing using OpenAssistant. Dive into topics like handling labeled data, setting chat templates for tokenizers, addressing hallucinations, and working with mixed-language datasets. Gain insights on text classification models, extracting structured data from PDFs, multi-GPU training, and implementing RAG pipelines. Access additional resources and a Colab notebook to enhance your understanding of data preparation strategies for optimal machine learning outcomes.
Syllabus
Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions
Taught by
Trelis Research
Related Courses
Graph Partitioning and ExpandersStanford University via NovoEd The Analytics Edge
Massachusetts Institute of Technology via edX More Data Mining with Weka
University of Waikato via Independent Mining Massive Datasets
Stanford University via edX The Caltech-JPL Summer School on Big Data Analytics
California Institute of Technology via Coursera