YoVDO

History and Evolution of Data Lake Architecture - Post Lambda Architecture

Offered By: Linux Foundation via YouTube

Tags

Hadoop Courses Distributed Computing Courses Real-Time Analytics Courses Lambda Architecture Courses Delta Lake Courses Apache Hudi Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the history and evolution of data lake architecture in this 49-minute Linux Foundation conference talk. Delve into the post-Lambda Architecture era with speakers Takuya Fukuhisa and Masaru Dobashi from NTT DATA. Trace the development from Hadoop's distributed computing capabilities to the emergence of technologies enhancing application productivity. Examine the progression of data processing, including SQL on Hadoop, column-oriented formats, and real-time analytics requirements. Analyze the challenges of batch and stream-focused architectures, as well as the limitations of Lambda Architecture. Gain insights into modern storage layer solutions like Delta Lake, Apache Hudi, and Apache Iceberg, comparing their approaches to transaction management, version control, and balancing trade-offs in data lake design.

Syllabus

Intro
Hadoop enabled enterprises to store and process huge data with distributed computing using commodity hardware.
However, it was troublesome to write MapReduce applications directly Lots of technologies to increase the productivity of application were born. They abstracted MapReduce-based distributed computing.
Data processing with the low latency
HBase add a feature to handle the small size of data into Hadoop ecosystem.
SQL on Hadoop After the distributed computing became popular, various SOL on users developed
The column-oriented format was getting to be known as a technology to DWH system, as well as Hadoop ecosystem, uses these kinds of formats.
Traditional requirements for Storage Layer Traditional requirements for Hadoop will continue to be required Scalability
Use case example that require Real-time Analytics By analyzing the latest activity and accumulated history, it is possible to link useful information to users and store in real time. 1. Accumulate data in advance by batch and stream input
What are the problems with "Real-time Analytics" architecture? Batch-and stream-focused architecture makes it difficult to meet real-time and diverse analytical requirements Batch architecture
What are the problems with Lambda Architecture? Lambda Architecture that integrates batch/stream processing makes it difficult to ensure the integrity and increase costs associated with pipeline compledty Lambda Architecture integrates batch and stream pipelines
Overview of Delta Lake Storage for transaction management and version control
Apache Hudi vs Apache Iceberg and Delta Lake Each product has devised a reading method while realizing high-speed writing with a simple method. Apache Hudi
Apache Iceberg and Delta Lake handle management information in different file structures Apache Iceberg
Consideration about trade-off Each recent storage layer software has taken various approaches in the direction of balancing the trade-off


Taught by

Linux Foundation

Tags

Related Courses

Transform Your Machine Learning Pipelines with Apache Hudi
Linux Foundation via YouTube
Delivering Portability to Open Data Lakes with Delta Lake UniForm
Databricks via YouTube
Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
Databricks via YouTube
Apache XTable - Interoperability Among Lakehouse Table Formats
Databricks via YouTube
How to Migrate from Snowflake to an Open Data Lakehouse Using Delta Lake UniForm
Databricks via YouTube