Introduction to Data Versioning with DVC
Offered By: DataCamp
Course Description
Overview
Explore Data Version Control for ML data management. Master setup, automate pipelines, and evaluate models seamlessly.
Delve into Data Version Control (DVC), a tool for managing and versioning ML data. Explore its role in the ML lifecycle, differentiate data versioning from code versioning, and examine DVC’s features and use cases. Learn about DVC setup, including cache management and remotes, and discover its applications in CI/CD, experiment tracking, and pipelines. Automate ML pipelines, emphasizing code modularization, and practice executing them efficiently. Conclude with model evaluation, exploring metric tracking in DVC for informed decision-making.
Delve into Data Version Control (DVC), a tool for managing and versioning ML data. Explore its role in the ML lifecycle, differentiate data versioning from code versioning, and examine DVC’s features and use cases. Learn about DVC setup, including cache management and remotes, and discover its applications in CI/CD, experiment tracking, and pipelines. Automate ML pipelines, emphasizing code modularization, and practice executing them efficiently. Conclude with model evaluation, exploring metric tracking in DVC for informed decision-making.
Syllabus
- Introduction to DVC
- This chapter provides a comprehensive introduction to Data Version Control (DVC), a tool essential for data versioning in machine learning. Learners will explore the motivation behind data versioning, understand its differences from code versioning, and experiment with a simple classification problem. They will review basic Git commands, learn about DVC, and practice setting up a repository. The chapter concludes with an overview of DVC’s features and use cases, including versioning data and models, CI/CD for machine learning, experiment tracking, pipelines, and more.
- DVC Configuration and Data Management
- This chapter delves into the setup of DVC, encompassing aspects such as installation, initialization of the repository, and the utilization of the .dvcignore file. It further navigates through the exploration of DVC cache and staging files, imparting knowledge on how to add and remove files, manage caches, and comprehend the underlying mechanisms using the MD5 hash. The chapter also elucidates on DVC remotes, distinguishing them from Git remotes, and guides you on how to add, list, and modify them. Lastly, it teaches you how to interact with these remotes by pushing and pulling data, checking out specific versions, and fetching data to the cache.
- Pipelines in DVC
- This chapter focuses on automating ML pipelines using DVC. Learners create a configuration file containing settings and hyperparameters. They also learn about pipeline visualization using directed acyclic graphs and use commands to describe dependencies, commands, and outputs. Execution of DVC pipelines is covered, including local model training and how Git tracks DVC metadata. Additionally, learners explore metrics and plots tracking in DVC, including how to print metrics, create plot files, and compare metrics and plots across different pipeline stages.
Taught by
Ravi Bhadauria
Related Courses
Introduction to JenkinsLinux Foundation via edX Introduction to Cloud Native, DevOps, Agile, and NoSQL
IBM via edX Learn Azure DevOps CI/CD pipelines
Udemy IBM Full Stack Software Developer
IBM via Coursera DevOps: CI/CD with Jenkins pipelines, Maven, Gradle
Udemy