Spark-Syntax
This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark
for 3 years. This will mainly focus on the Spark DataFrames and SQL
library.
you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.
Contributing/Topic Requests
If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it
If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it
Acknowledgement
Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.
Table of Contexts:
Chapter 1 - Getting Started with Spark:
-
Useful Material
1.1 - -
Creating your First DataFrame
1.2 - -
Reading your First Dataset
1.3 - -
More Comfortable with SQL?
1.4 -
Chapter 2 - Exploring the Spark APIs:
-
2.1 - Non-Trivial Data Structures in Spark
-
Struct Types (
2.1.1 -StructType
) -
Arrays and Lists (
2.1.2 -ArrayType
) -
Maps and Dictionaries (
2.1.3 -MapType
) -
Decimals and Why did my Decimals overflow :( (
2.1.4 -DecimalType
)
-
-
Performing your First Transformations
2.2 --
Looking at Your Data (
2.2.1 -collect
/head
/take
/first
/toPandas
/show
) -
Selecting a Subset of Columns (
2.2.2 -drop
/select
) -
Creating New Columns and Transforming Data (
2.2.3 -withColumn
/withColumnRenamed
) -
Constant Values and Column Expressions (
2.2.4 -lit
/col
) -
Casting Columns to a Different Type (
2.2.5 -cast
) -
Filtering Data (
2.2.6 -where
/filter
/isin
) -
Equality Statements in Spark and Comparisons with Nulls (
2.2.7 -isNotNull()
/isNull()
) -
Case Statements (
2.2.8 -when
/otherwise
) -
Filling in Null Values (
2.2.9 -fillna
/coalesce
) -
Spark Functions aren't Enough, I Need my Own! (
2.2.10 -udf
/pandas_udf
) -
Unionizing Multiple Dataframes (
2.2.11 -union
) -
Performing Joins (clean one) (
2.2.12 -join
)
-
-
2.3 More Complex Transformations
-
One to Many Rows (
2.3.1 -explode
) -
Range Join Conditions (WIP) (
2.3.2 -join
)
-
-
2.4 Potential Performance Boosting Functions
-
2.4.1 - (repartition
) -
2.4.2 - (coalesce
) -
2.4.2 - (cache
) -
2.4.2 - (broadcast
)
-
Chapter 3 - Aggregates:
-
Clean Aggregations
3.1 - -
Non Deterministic Behaviours
3.2 -
Chapter 4 - Window Objects:
Chapter 5 - Error Logs:
Chapter 6 - Understanding Spark Performance:
-
6.1 - Primer to Understanding Your Spark Application
-
Understanding how Spark Works
6.1.1 - -
6.1.2 - Understanding the SparkUI
-
6.1.3 - Understanding how the DAG is Created
-
6.1.4 - Understanding how Memory is Allocated
-
-
6.2 - Analyzing your Spark Application
-
6.1 - Looking for Skew in a Stage
-
6.2 - Looking for Skew in the DAG
-
6.3 - How to Determine the Number of Partitions to Use
-
-
6.3 - How to Analyze the Skew of Your Data
Chapter 7 - High Performance Code:
-
7.0 - The Types of Join Strategies in Spark
-
7.0.1 - You got a Small Table? (Broadcast Join
) -
7.0.2 - The Ideal Strategy (BroadcastHashJoin
) -
7.0.3 - The Default Strategy (SortMergeJoin
)
-
-
7.1 - Improving Joins
-
Filter Pushdown
7.1.1 - -
Joining on Skewed Data (Null Keys)
7.1.2 - -
Joining on Skewed Data (High Frequency Keys I)
7.1.3 - -
7.1.4 - Joining on Skewed Data (High Frequency Keys II)
-
7.1.5 - Join Ordering
-
-
7.2 - Repeated Work on a Single Dataset (caching
)-
7.2.1 - caching layers
-
-
7.3 - Spark Parameters
-
7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation
) -
7.3.2 - The magical number2001
(partitions
) -
7.3.3 - Using a lot ofUDF
s? (python memory
)
-
-
7. - Bloom Filters :o?