YoVDO

PySpark Best Practices

Offered By: Open Data Science via YouTube

Tags

PySpark Courses Data Analysis Courses Python Courses Apache Spark Courses Distributed Computing Courses

Course Description

Overview

Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.

Syllabus

cloudera
Spark Execution Model
PySpark Driver Program
How do we ship around Python functions?
Pickle!
DataFrame is just another word for...
Use DataFrames
REPLs and Notebooks
Share your code
Standard Python Project
What is the shape of a PySpark job?
PySpark Structure?
Simple Main Method
Write Testable Code
Write Serializable Code
Testing with SparkTestingBase
Testing Suggestions
Writing distributed code is the easy part...
Get Serious About Logs
Know your environment
Complex Dependencies
Many Python Environments


Taught by

Open Data Science

Related Courses

Artificial Intelligence for Robotics
Stanford University via Udacity
Intro to Computer Science
University of Virginia via Udacity
Design of Computer Programs
Stanford University via Udacity
Web Development
Udacity
Programming Languages
University of Virginia via Udacity