YoVDO

PySpark Best Practices

Offered By: Open Data Science via YouTube

Tags

PySpark Courses Data Analysis Courses Python Courses Apache Spark Courses Distributed Computing Courses

Course Description

Overview

Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.

Syllabus

cloudera
Spark Execution Model
PySpark Driver Program
How do we ship around Python functions?
Pickle!
DataFrame is just another word for...
Use DataFrames
REPLs and Notebooks
Share your code
Standard Python Project
What is the shape of a PySpark job?
PySpark Structure?
Simple Main Method
Write Testable Code
Write Serializable Code
Testing with SparkTestingBase
Testing Suggestions
Writing distributed code is the easy part...
Get Serious About Logs
Know your environment
Complex Dependencies
Many Python Environments


Taught by

Open Data Science

Related Courses

Fundamentals of Scalable Data Science
IBM via Coursera
Data Science and Engineering with Spark
Berkeley University of California via edX
Master of Machine Learning and Data Science
Imperial College London via Coursera
Data Analysis Using Pyspark
Coursera Project Network via Coursera
Building Machine Learning Pipelines in PySpark MLlib
Coursera Project Network via Coursera