Data Preparation Tips and Tricks for Machine Learning
Offered By: Trelis Research via YouTube
Course Description
Overview
Explore data preparation techniques for machine learning in this comprehensive one-hour video tutorial. Learn about filtering and deduplication using FineWeb, balance concepts with hierarchical k-means filtering, and see a live demonstration of dataset balancing using OpenAssistant. Dive into topics like handling labeled data, setting chat templates for tokenizers, addressing hallucinations, and working with mixed-language datasets. Gain insights on text classification models, extracting structured data from PDFs, multi-GPU training, and implementing RAG pipelines. Access additional resources and a Colab notebook to enhance your understanding of data preparation strategies for optimal machine learning outcomes.
Syllabus
Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions
Taught by
Trelis Research
Related Courses
TensorFlow: Working with NLPLinkedIn Learning Introduction to Video Editing - Video Editing Tutorials
Great Learning via YouTube HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning
Python Engineer via YouTube GPT3 and Finetuning the Core Objective Functions - A Deep Dive
David Shapiro ~ AI via YouTube How to Build a Q&A AI in Python - Open-Domain Question-Answering
James Briggs via YouTube