Data Preparation Tips and Tricks for Machine Learning
Offered By: Trelis Research via YouTube
Course Description
Overview
Explore data preparation techniques for machine learning in this comprehensive one-hour video tutorial. Learn about filtering and deduplication using FineWeb, balance concepts with hierarchical k-means filtering, and see a live demonstration of dataset balancing using OpenAssistant. Dive into topics like handling labeled data, setting chat templates for tokenizers, addressing hallucinations, and working with mixed-language datasets. Gain insights on text classification models, extracting structured data from PDFs, multi-GPU training, and implementing RAG pipelines. Access additional resources and a Colab notebook to enhance your understanding of data preparation strategies for optimal machine learning outcomes.
Syllabus
Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions
Taught by
Trelis Research
Related Courses
Passion Driven StatisticsWesleyan University via Coursera Machine Learning With Big Data
University of California, San Diego via Coursera Big Data - Capstone Project
University of California, San Diego via Coursera Data Science at Scale - Capstone Project
University of Washington via Coursera Анализ данных: финальный проект
Moscow Institute of Physics and Technology via Coursera