YoVDO

Koalas: Scaling Pandas APIs on Apache Spark - Performance and Comparison with Dask

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Data Science Courses Big Data Courses pandas Courses Data Manipulation Courses Benchmarking Courses Dask Courses

Course Description

Overview

Explore the capabilities and performance of Koalas, an open-source project providing pandas APIs on top of Apache Spark, in this 24-minute talk from Databricks. Learn how Koalas bridges the gap between pandas' data science functionality and Apache Spark's scalability for big data. Compare Koalas with other pandas-scaling libraries, particularly Dask, through benchmarking and performance analysis. Discover the internal framework, execution time improvements, influence of Catalyst, and code generation techniques. Gain insights into recent updates and main changes in Koalas, equipping you with knowledge to effectively handle large-scale data manipulation and analysis.

Syllabus

Introduction
What is Koalas
Internal Frame
Benchmark
Results
Execution Time
Influence of Catalyst
Code Generation
Benchmark Results
Whats New
Main Changes


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera