• Stars
    star
    605
  • Rank 74,072 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PySpark test helper methods with beautiful error messages

chispa

image PyPI - Downloads PyPI version

chispa provides fast PySpark test helper methods that output descriptive error messages.

This library makes it easy to write high quality PySpark code.

Fun fact: "chispa" means Spark in Spanish ;)

Installation

Install the latest version with pip install chispa.

If you use Poetry, add this library as a development dependency with poetry add chispa -G dev.

Column equality

Suppose you have a function that removes the non-word characters in a string.

def remove_non_word_characters(col):
    return F.regexp_replace(col, "[^\\w\\s]+", "")

Create a SparkSession so you can create DataFrames.

from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .master("local")
  .appName("chispa")
  .getOrCreate())

Create a DataFrame with a column that contains strings with non-word characters, run the remove_non_word_characters function, and check that all these characters are removed with the chispa assert_column_equality method.

import pytest

from chispa.column_comparer import assert_column_equality
import pyspark.sql.functions as F

def test_remove_non_word_characters_short():
    data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")

Let's write another test that'll fail to see how the descriptive error message lets you easily debug the underlying issue.

Here's the failing test:

def test_remove_non_word_characters_nice_error():
    data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")

Here's the nicely formatted error message:

ColumnsNotEqualError

You can see the matt7 / matt row of data is what's causing the error (note it's highlighted in red). The other rows are colored blue because they're equal.

DataFrame equality

We can also test the remove_non_word_characters method by creating two DataFrames and verifying that they're equal.

Creating two DataFrames is slower and requires more code, but comparing entire DataFrames is necessary for some tests.

from chispa.dataframe_comparer import *

def test_remove_non_word_characters_long():
    source_data = [
        ("jo&&se",),
        ("**li**",),
        ("#::luisa",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)

Let's write another test that'll return an error, so you can see the descriptive error message.

def test_remove_non_word_characters_long_error():
    source_data = [
        ("matt7",),
        ("bill&",),
        ("isabela*",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)

Here's the nicely formatted error message:

DataFramesNotEqualError

Ignore row order

You can easily compare DataFrames, ignoring the order of the rows. The content of the DataFrames is usually what matters, not the order of the rows.

Here are the contents of df1:

+--------+
|some_num|
+--------+
|       1|
|       2|
|       3|
+--------+

Here are the contents of df2:

+--------+
|some_num|
+--------+
|       2|
|       1|
|       3|
+--------+

Here's how to confirm df1 and df2 are equal when the row order is ignored.

assert_df_equality(df1, df2, ignore_row_order=True)

If you don't specify to ignore_row_order then the test will error out with this message:

ignore_row_order_false

The rows aren't ordered by default because sorting slows down the function.

Ignore column order

This section explains how to compare DataFrames, ignoring the order of the columns.

Suppose you have the following df1:

+----+----+
|num1|num2|
+----+----+
|   1|   7|
|   2|   8|
|   3|   9|
+----+----+

Here are the contents of df2:

+----+----+
|num2|num1|
+----+----+
|   7|   1|
|   8|   2|
|   9|   3|
+----+----+

Here's how to compare the equality of df1 and df2, ignoring the column order:

assert_df_equality(df1, df2, ignore_column_order=True)

Here's the error message you'll see if you run assert_df_equality(df1, df2), without ignoring the column order.

ignore_column_order_false

Ignore nullability

Each column in a schema has three properties: a name, data type, and nullable property. The column can accept null values if nullable is set to true.

You'll sometimes want to ignore the nullable property when making DataFrame comparisons.

Suppose you have the following df1:

+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+

And this df2:

+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+

You might be surprised to find that in this example, df1 and df2 are not equal and will error out with this message:

nullable_off_error

Examine the code in this contrived example to better understand the error:

def ignore_nullable_property():
    s1 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), True)])
    df1 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s1)
    s2 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), False)])
    df2 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s2)
    assert_df_equality(df1, df2)

You can ignore the nullable property when assessing equality by adding a flag:

assert_df_equality(df1, df2, ignore_nullable=True)

Elements contained within an ArrayType() also have a nullable property, in addition to the nullable property of the column schema. These are also ignored when passing ignore_nullable=True.

Again, examine the following code to understand the error that ignore_nullable=True bypasses:

def ignore_nullable_property_array():
    s1 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), True), True),])
    df1 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s1)
    s2 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), False), True),])
    df2 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s2)
    assert_df_equality(df1, df2)

Allow NaN equality

Python has NaN (not a number) values and two NaN values are not considered equal by default. Create two NaN values, compare them, and confirm they're not considered equal by default.

nan1 = float('nan')
nan2 = float('nan')
nan1 == nan2 # False

Pandas, a popular DataFrame library, does consider NaN values to be equal by default.

This library requires you to set a flag to consider two NaN values to be equal.

assert_df_equality(df1, df2, allow_nan_equality=True)

Approximate column equality

We can check if columns are approximately equal, which is especially useful for floating number comparisons.

Here's a test that creates a DataFrame with two floating point columns and verifies that the columns are approximately equal. In this example, values are considered approximately equal if the difference is less than 0.1.

def test_approx_col_equality_same():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 3.37),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)

Here's an example of a test with columns that are not approximately equal.

def test_approx_col_equality_different():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 5.0),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)

This failing test will output a readable error message so the issue is easy to debug.

ColumnsNotEqualError

Approximate DataFrame equality

Let's create two DataFrames and confirm they're approximately equal.

def test_approx_df_equality_same():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.05, "a"),
        (2.13, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)

The assert_approx_df_equality method is smart and will only perform approximate equality operations for floating point numbers in DataFrames. It'll perform regular equality for strings and other types.

Let's perform an approximate equality comparison for two DataFrames that are not equal.

def test_approx_df_equality_different():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.1, "a"),
        (5.0, "b"),
        (3.3, "z"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)

Here's the pretty error message that's outputted:

DataFramesNotEqualError

Schema mismatch messages

DataFrame equality messages peform schema comparisons before analyzing the actual content of the DataFrames. DataFrames that don't have the same schemas should error out as fast as possible.

Let's compare a DataFrame that has a string column an integer column with a DataFrame that has two integer columns to observe the schema mismatch message.

def test_schema_mismatch_message():
    data1 = [
        (1, "a"),
        (2, "b"),
        (3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1, 6),
        (2, 7),
        (3, 8),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "num2"])

    assert_df_equality(df1, df2)

Here's the error message:

SchemasNotEqualError

Supported PySpark / Python versions

chispa currently supports PySpark 2.4+ and Python 3.5+.

Use chispa v0.8.2 if you're using an older Python version.

PySpark 2 support will be dropped when chispa 1.x is released.

Benchmarks

TODO: Need to benchmark these methods vs. the spark-testing-base ones

Vendored dependencies

These dependencies are vendored:

The dependencies are vendored to save you from dependency hell.

Developing chispa on your local machine

You are encouraged to clone and/or fork this repo.

This project uses Poetry for packaging and dependency management.

  • Setup the virtual environment with poetry install
  • Run the tests with poetry run pytest tests

Studying the codebase is a great way to learn about PySpark!

Contributing

Anyone is encouraged to submit a pull request, open an issue, or submit a bug report.

We're happy to promote folks to be library maintainers if they make good contributions.

More Repositories

1

spark-daria

Essential Spark extensions and helper methods ✨😲
Scala
746
star
2

quinn

pyspark methods to enhance developer productivity πŸ“£ πŸ‘― πŸŽ‰
Python
604
star
3

spark-fast-tests

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Scala
421
star
4

mack

Delta Lake helper methods in PySpark
Python
294
star
5

spark-style-guide

Spark style guide
Jupyter Notebook
229
star
6

code_quizzer

Programming practice questions with Ruby, JavaScript, Rails, and Bash.
HTML
201
star
7

frontend-generators

Rake tasks to add Bootstrap, Font Awesome, and Start Bootstrap Landing Pages to a Rails app
CSS
96
star
8

spark-sbt.g8

A giter8 template for Spark SBT projects
Scala
73
star
9

spark-stringmetric

Spark functions to run popular phonetic and string matching algorithms
Scala
55
star
10

bebe

Filling in the Spark function gaps across APIs
Scala
50
star
11

jodie

Delta lake and filesystem helper methods
Scala
47
star
12

farsante

Fake Pandas / PySpark DataFrame creator
Rust
35
star
13

beavis

Pandas helper functions
Python
25
star
14

tic_tac_toe

Ruby tic tac toe game
Ruby
25
star
15

ceja

PySpark phonetic and string matching algorithms
Python
24
star
16

spark-test-example

Spark DataFrame transformation and UDF test examples
Scala
22
star
17

spark-spec

Test suite to document the behavior of Spark
Scala
21
star
18

gill

An example PySpark project with pytest
Python
18
star
19

directed_graph

Modeling directed acyclic graphs (DAG) for topological sorting, shortest path, longest path, etc.
Ruby
14
star
20

spark-slack

Speak Slack notifications and process Slack slash commands
Scala
13
star
21

scalatest-example

Testing Scala code with scalatest
Scala
11
star
22

python-parquet-examples

Using the Parquet file format with Python
Python
11
star
23

levi

Delta Lake helper methods. No Spark dependency.
Python
10
star
24

unicron

DAGs on DAGs! Smart PySpark custom transformation runner
Python
10
star
25

pysparktestingexample

PySpark testing example project
Python
9
star
26

JavaSpark

Example Spark project with Java API
Java
9
star
27

spark-pika

Demo how to set up Spark with SBT
Scala
7
star
28

spark-etl

Lightweight Spark ETL framework
Scala
6
star
29

slack_trello

Helping Slack and Trello play together nicely
Ruby
6
star
30

mill_spark_example

Apache Spark project with the Mill build tool
Scala
6
star
31

mrpowers-benchmarks

MrPowers benchmarks for Dask, Polars, DataFusion, and pandas
Jupyter Notebook
5
star
32

pydata-style-guide

Style for the PyData stack
5
star
33

walle

Compression algorithms for different file formats
Python
5
star
34

angelou

PySpark on Poetry example
Python
5
star
35

great-spark

Curated collection of Spark libraries and example applications
5
star
36

appa

Data lake metadata / transaction log store
Python
5
star
37

turf

Set application variables for the development, test, and production environments
Ruby
5
star
38

eren

PySpark Hive helper methods
Python
5
star
39

prawn_charts

Prawn gem to develop vector line charts
Ruby
5
star
40

spark-bulba

Tutorial on running faster tests with Spark
Scala
4
star
41

ml-book

Introduction to Machine Learning with Python Book
Jupyter Notebook
4
star
42

blake

Great Pandas and Jupyter workflow with Poetry
Jupyter Notebook
4
star
43

cmap

Model cmap exports as a directed graph and generate SQL
Ruby
4
star
44

redshift_extractor

Using the Redshift UNLOAD/COPY commands to move data from one Redshift cluster/database to another
Ruby
4
star
45

deltadask

Delta Lake powered by Dask
Jupyter Notebook
4
star
46

spark-frameless

Typed Datasets with Spark
Scala
4
star
47

spark-examples

A Spark playground to help me write blog posts
Scala
4
star
48

slack_notifier_wrapper

Making it easier to work with the slack_notifier gem
Ruby
3
star
49

scalate-example

Templates in Scala with Scalate
Scala
3
star
50

rails-startbootstrap-creative

Creative by Start Bootstrap - Rails Version
Ruby
3
star
51

repo_tools

Easily manage clone Git repos in Ruby applications
Ruby
3
star
52

munit-example

Simple example of the MUnit testing library
Scala
3
star
53

hll-example

Implementing HyperLogLog functions in Spark
Scala
3
star
54

pyspark-spec

Documents the behavior of pyspark
Python
3
star
55

mungingdata

Code to support MungingData blog posts: https://mungingdata.com/
Scala
3
star
56

dask-interop

Integration tests to demonstrate Dask's interoperability with other systems
Python
3
star
57

dask-fun

Dask examples with tests
Jupyter Notebook
3
star
58

vimtraining

Practicing Vim after completing the vimtutor
2
star
59

scala-design

Core Scala language features and design patterns
Scala
2
star
60

GameBoard

This is a GameBoard class with methods to help analyze the grid.
Ruby
2
star
61

technical_writing

Elements of style for blogs, books, and presentations
2
star
62

cali

Guide to provision a Mac for developers
Vim Script
2
star
63

data-scrapbook

A collection of images and captions to explain core data concepts
2
star
64

learn_spanish

Logically learn Spanish
Ruby
2
star
65

sapo

Data store validator for sqlite, Parquet
Python
2
star
66

project_euler

Some Project Euler solutions
Ruby
2
star
67

yellow-taxi

Data lake fun!
Scala
2
star
68

sqlite-example

Creating a sqlite db and writing it to files
Jupyter Notebook
2
star
69

http_validator

Ruby
2
star
70

polars-fun

Example notebooks for how to use pola.rs
Jupyter Notebook
2
star
71

mesita

Print colorful tables with nice diffs in the Terminal
Python
2
star
72

spark-utest

Example of how to use uTest with Spark
Scala
1
star
73

mrpowers-book

Book on MrPowers OSS projects, blogs, and other assets
1
star
74

doctor_scrabble

Rails Scrabble App
Ruby
1
star
75

dotfiles

My dotfiles
Shell
1
star
76

mini_yelp

Ruby
1
star
77

eli5_ruby_cs

explain like I'm 5: computer science with ruby
Ruby
1
star
78

mrpowers.github.io

Documentation and stuff
HTML
1
star
79

pyspark-examples

PySpark example notebooks
Jupyter Notebook
1
star
80

tic_tac_toe_js

A tic tac toe game, written in JS, with DOM crap isolated out of the way
JavaScript
1
star
81

go-example

Simple Go project
Go
1
star
82

custom_tableau

Using JavaScript to create Tableu-like dashboards
JavaScript
1
star
83

ansible_playbooks

Ansible playbooks
Ruby
1
star
84

javascript_book

Teaching JavaScript logically without being dorks
1
star
85

rails-startbootstrap-freelancer

Rails implementation of the Start Bootstrap Freelancer theme
CSS
1
star
86

express_practice

Some practice exercises for building Node and Express applications
JavaScript
1
star