YoVDO

Data Preparation Tips and Tricks for Machine Learning

Offered By: Trelis Research via YouTube

Tags

Data Preparation Courses Machine Learning Courses Clustering Courses Fine-Tuning Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore data preparation techniques for machine learning in this comprehensive one-hour video tutorial. Learn about filtering and deduplication using FineWeb, balance concepts with hierarchical k-means filtering, and see a live demonstration of dataset balancing using OpenAssistant. Dive into topics like handling labeled data, setting chat templates for tokenizers, addressing hallucinations, and working with mixed-language datasets. Gain insights on text classification models, extracting structured data from PDFs, multi-GPU training, and implementing RAG pipelines. Access additional resources and a Colab notebook to enhance your understanding of data preparation strategies for optimal machine learning outcomes.

Syllabus

Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions


Taught by

Trelis Research

Related Courses

Passion Driven Statistics
Wesleyan University via Coursera
Machine Learning With Big Data
University of California, San Diego via Coursera
Big Data - Capstone Project
University of California, San Diego via Coursera
Data Science at Scale - Capstone Project
University of Washington via Coursera
Анализ данных: финальный проект
Moscow Institute of Physics and Technology via Coursera