Versioning, Syncing & Streaming Large Datasets Using DAT + Node
Offered By: JSConf via YouTube
Course Description
Overview
Syllabus
Intro
dat is an open source tool for sharing and collaborating on data
analogy time: lets talk about source control
life before git
1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
maintainer creates new zip of cool-project that might contain my fix
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
max, import your genome into dat
data is stored locally in leveldb blobs are stored in blob-stores
choose the blob store that fits your use case s3, local-fs
auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate
calculate how big npm is using dat
dat cat transform
dat cat docker run-i transform
transform the npm data using bulk-markdown-to-png
use case: trillian astronomical
1. full sky scans 2. detect objects
problems: huge files, weird format
1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry
Taught by
JSConf
Related Courses
MongoDB for Node.js DevelopersMongoDB University Introduction to Office 365 Development and APIs
Microsoft via edX Server-side Development with NodeJS
The Hong Kong University of Science and Technology via Coursera Front-End Web UI Frameworks and Tools
The Hong Kong University of Science and Technology via Coursera Introduction to MongoDB using the MEAN Stack
MongoDB via edX