Versioning, Syncing & Streaming Large Datasets Using DAT + Node
Offered By: JSConf via YouTube
Course Description
Overview
Syllabus
Intro
dat is an open source tool for sharing and collaborating on data
analogy time: lets talk about source control
life before git
1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
maintainer creates new zip of cool-project that might contain my fix
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
max, import your genome into dat
data is stored locally in leveldb blobs are stored in blob-stores
choose the blob store that fits your use case s3, local-fs
auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate
calculate how big npm is using dat
dat cat transform
dat cat docker run-i transform
transform the npm data using bulk-markdown-to-png
use case: trillian astronomical
1. full sky scans 2. detect objects
problems: huge files, weird format
1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry
Taught by
JSConf
Related Courses
Google Cloud Big Data and Machine Learning Fundamentals en EspañolGoogle Cloud via Coursera Big Data Emerging Technologies
Yonsei University via Coursera Building Resilient Streaming Systems on GCP em Português Brasileiro
Google Cloud via Coursera Building Resilient Streaming Systems on Google Cloud Platform en Español
Google Cloud via Coursera AWS Certified Data Analytics Specialty 2024 - Hands On!
Udemy