Versioning, Syncing & Streaming Large Datasets Using DAT + Node
Offered By: JSConf via YouTube
Course Description
Overview
Syllabus
Intro
dat is an open source tool for sharing and collaborating on data
analogy time: lets talk about source control
life before git
1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
maintainer creates new zip of cool-project that might contain my fix
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
max, import your genome into dat
data is stored locally in leveldb blobs are stored in blob-stores
choose the blob store that fits your use case s3, local-fs
auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate
calculate how big npm is using dat
dat cat transform
dat cat docker run-i transform
transform the npm data using bulk-markdown-to-png
use case: trillian astronomical
1. full sky scans 2. detect objects
problems: huge files, weird format
1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry
Taught by
JSConf
Related Courses
Données et services numériques, dans le nuage et ailleursCertificat informatique et internet via France Université Numerique Introduction to Digital Curation
University College London via Independent Excel Avanzado
Miríadax SAP Business Warehouse powered by SAP HANA
SAP Learning Programming Mobile Applications for Android Handheld Systems: Part 2
University of Maryland, College Park via Coursera