• Stars
    star
    422
  • Rank 100,281 (Top 3 %)
  • Language
    Scala
  • License
    MIT License
  • Created about 7 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

spark-fast-tests

CI

A fast Apache Spark testing helper library with beautifully formatted error messages! Works with scalatest, uTest, and munit.

Use chispa for PySpark applications.

Read Testing Spark Applications for a full explanation on the best way to test Spark code! Good test suites yield higher quality codebases that are easy to refactor.

Install

Fetch the JAR file from Maven.

// for Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "1.1.0" % "test"

// for Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.23.0" % "test"

Here's a link to the releases for different Scala versions:

You should use Scala 2.11 with Spark 2 and Scala 2.12 / 2.13 with Spark 3.

Simple examples

The assertSmallDatasetEquality method can be used to compare two Datasets (or two DataFrames).

val sourceDF = Seq(
  (1),
  (5)
).toDF("number")

val expectedDF = Seq(
  (1, "word"),
  (5, "word")
).toDF("number", "word")

assertSmallDataFrameEquality(sourceDF, expectedDF)
// throws a DatasetSchemaMismatch exception

The assertSmallDatasetEquality method can also be used to compare Datasets.

val sourceDS = Seq(
  Person("juan", 5),
  Person("bob", 1),
  Person("li", 49),
  Person("alice", 5)
).toDS

val expectedDS = Seq(
  Person("juan", 5),
  Person("frank", 10),
  Person("li", 49),
  Person("lucy", 5)
).toDS

assert_small_dataset_equality_error_message

The colors in the error message make it easy to identify the rows that aren't equal.

The DatasetComparer has assertSmallDatasetEquality and assertLargeDatasetEquality methods to compare either Datasets or DataFrames.

If you only need to compare DataFrames, you can use DataFrameComparer with the associated assertSmallDataFrameEquality and assertLargeDataFrameEquality methods. Under the hood, DataFrameComparer uses the assertSmallDatasetEquality and assertLargeDatasetEquality.

Note : comparing Datasets can be tricky since some column names might be given by Spark when applying transformations. Use the ignoreColumnNames boolean to skip name verification.

Why is this library fast?

This library provides three main methods to test your code.

Suppose you'd like to test this function:

def myLowerClean(col: Column): Column = {
  lower(regexp_replace(col, "\\s+", ""))
}

Here's how long the tests take to execute:

test method runtime
assertLargeDataFrameEquality 709 milliseconds
assertSmallDataFrameEquality 166 milliseconds
assertColumnEquality 108 milliseconds
evalString 26 milliseconds

evalString isn't as robust, but is the fastest. assertColumnEquality is robust and saves a lot of time.

Other testing libraries don't have methods like assertSmallDataFrameEquality or assertColumnEquality so they run slower.

Usage

The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one yourself.

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession
      .builder()
      .master("local")
      .appName("spark session")
      .config("spark.sql.shuffle.partitions", "1")
      .getOrCreate()
  }

}

It's best set the number of shuffle partitions to a small number like one or four in your test suite. This configuration can make your tests run up to 70% faster. You can remove this configuration option or adjust it if you're working with big DataFrames in your test suite.

Make sure to only use the SparkSessionTestWrapper trait in your test suite. You don't want to use test specific configuration (like one shuffle partition) when running production code.

The DatasetComparer trait defines the assertSmallDatasetEquality method. Extend your spec file with the SparkSessionTestWrapper trait to create DataFrames and the DatasetComparer trait to make DataFrame comparisons.

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import com.github.mrpowers.spark.fast.tests.DatasetComparer

class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

  it("aliases a DataFrame") {

    val sourceDF = Seq(
      ("jose"),
      ("li"),
      ("luisa")
    ).toDF("name")

    val actualDF = sourceDF.select(col("name").alias("student"))

    val expectedDF = Seq(
      ("jose"),
      ("li"),
      ("luisa")
    ).toDF("student")

    assertSmallDatasetEquality(actualDF, expectedDF)

  }
}

To compare large DataFrames that are partitioned across different nodes in a cluster, use the assertLargeDatasetEquality method.

assertLargeDatasetEquality(actualDF, expectedDF)

assertSmallDatasetEquality is faster for test suites that run on your local machine. assertLargeDatasetEquality should only be used for DataFrames that are split across nodes in a cluster.

Column Equality

The assertColumnEquality method can be used to assess the equality of two columns in a DataFrame.

Suppose you have the following DataFrame with two columns that are not equal.

+-------+-------------+
|   name|expected_name|
+-------+-------------+
|   phil|         phil|
| rashid|       rashid|
|matthew|        mateo|
|   sami|         sami|
|     li|         feng|
|   null|         null|
+-------+-------------+

The following code will throw a ColumnMismatch error message:

assertColumnEquality(df, "name", "expected_name")

assert_column_equality_error_message

Mix in the ColumnComparer trait to your test class to access the assertColumnEquality method:

import com.github.mrpowers.spark.fast.tests.ColumnComparer

object MySpecialClassTest
    extends TestSuite
    with ColumnComparer
    with SparkSessionTestWrapper {

    // your tests
}

Unordered DataFrame equality comparisons

Suppose you have the following actualDF:

+------+
|number|
+------+
|     1|
|     5|
+------+

And suppose you have the following expectedDF:

+------+
|number|
+------+
|     5|
|     1|
+------+

The DataFrames have the same columns and rows, but the order is different.

assertSmallDataFrameEquality(sourceDF, expectedDF) will throw a DatasetContentMismatch error.

We can set the orderedComparison boolean flag to false and spark-fast-tests will sort the DataFrames before performing the comparison.

assertSmallDataFrameEquality(sourceDF, expectedDF, orderedComparison = false) will not throw an error.

Equality comparisons ignoring the nullable flag

You might also want to make equality comparisons that ignore the nullable flags for the DataFrame columns.

Here is how to use the ignoreNullable flag to compare DataFrames without considering the nullable property of each column.

val sourceDF = spark.createDF(
  List(
    (1),
    (5)
  ), List(
    ("number", IntegerType, false)
  )
)

val expectedDF = spark.createDF(
  List(
    (1),
    (5)
  ), List(
    ("number", IntegerType, true)
  )
)

assertSmallDatasetEquality(sourceDF, expectedDF, ignoreNullable = true)

Approximate DataFrame Equality

The assertApproximateDataFrameEquality function is useful for DataFrames that contain DoubleType columns. The precision threshold must be set when using the assertApproximateDataFrameEquality function.

val sourceDF = spark.createDF(
  List(
    (1.2),
    (5.1),
    (null)
  ), List(
    ("number", DoubleType, true)
  )
)

val expectedDF = spark.createDF(
  List(
    (1.2),
    (5.1),
    (null)
  ), List(
    ("number", DoubleType, true)
  )
)

assertApproximateDataFrameEquality(sourceDF, expectedDF, 0.01)

Testing Tips

  • Use column functions instead of UDFs as described in this blog post
  • Try to organize your code as custom transformations so it's easy to test the logic elegantly
  • Don't write tests that read from files or write files. Dependency injection is a great way to avoid file I/O in you test suite.

Alternatives

The spark-testing-base project has more features (e.g. streaming support) and is compiled to support a variety of Scala and Spark versions.

You might want to use spark-fast-tests instead of spark-testing-base in these cases:

  • You want to use uTest or a testing framework other than scalatest
  • You want to run tests in parallel (you need to set parallelExecution in Test := false with spark-testing-base)
  • You don't want to include hive as a project dependency
  • You don't want to restart the SparkSession after each test file executes so the suite runs faster

Publishing

GPG & Sonatype need to be setup properly before running these commands. See the spark-daria README for more information.

It's a good idea to always run clean before running any publishing commands. It's also important to run clean before different publishing commands as well.

There is a two step process for publishing.

Generate Scala 2.11 JAR files:

  • Run sbt -Dspark.testVersion=2.4.8
  • Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven.

Generate Scala 2.12 & Scala 2.13 JAR files:

  • Run sbt
  • Run > ; + publishSigned; sonatypeBundleRelease

The publishSigned and sonatypeBundleRelease commands are made available by the sbt-sonatype plugin.

When the release command is run, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Additional Goals

  • Use memory efficiently so Spark test runs don't crash
  • Provide readable error messages
  • Easy to use in conjunction with other test suites
  • Give the user control of the SparkSession

Contributing

Open an issue or send a pull request to contribute. Anyone that makes good contributions to the project will be promoted to project maintainer status.

uTest settings to display color output

Create a CustomFramework class with overrides that turn off the default uTest color settings.

package com.github.mrpowers.spark.fast.tests

class CustomFramework extends utest.runner.Framework {
  override def formatWrapWidth: Int = 300
  // turn off the default exception message color, so spark-fast-tests
  // can send messages with custom colors
  override def exceptionMsgColor = toggledColor(utest.ufansi.Attrs.Empty)
  override def exceptionPrefixColor = toggledColor(utest.ufansi.Attrs.Empty)
  override def exceptionMethodColor = toggledColor(utest.ufansi.Attrs.Empty)
  override def exceptionPunctuationColor = toggledColor(utest.ufansi.Attrs.Empty)
  override def exceptionLineNumberColor = toggledColor(utest.ufansi.Attrs.Empty)
}

Update the build.sbt file to use the CustomFramework class:

testFrameworks += new TestFramework("com.github.mrpowers.spark.fast.tests.CustomFramework")

More Repositories

1

spark-daria

Essential Spark extensions and helper methods ✨😲
Scala
743
star
2

quinn

pyspark methods to enhance developer productivity 📣 👯 🎉
Python
589
star
3

chispa

PySpark test helper methods with beautiful error messages
Python
351
star
4

mack

Delta Lake helper methods in PySpark
Python
265
star
5

spark-style-guide

Spark style guide
Jupyter Notebook
229
star
6

code_quizzer

Programming practice questions with Ruby, JavaScript, Rails, and Bash.
HTML
201
star
7

frontend-generators

Rake tasks to add Bootstrap, Font Awesome, and Start Bootstrap Landing Pages to a Rails app
CSS
96
star
8

spark-sbt.g8

A giter8 template for Spark SBT projects
Scala
73
star
9

spark-stringmetric

Spark functions to run popular phonetic and string matching algorithms
Scala
55
star
10

bebe

Filling in the Spark function gaps across APIs
Scala
50
star
11

jodie

Delta lake and filesystem helper methods
Scala
44
star
12

farsante

Fake Pandas / PySpark DataFrame creator
Rust
34
star
13

beavis

Pandas helper functions
Python
25
star
14

tic_tac_toe

Ruby tic tac toe game
Ruby
25
star
15

ceja

PySpark phonetic and string matching algorithms
Python
24
star
16

spark-test-example

Spark DataFrame transformation and UDF test examples
Scala
22
star
17

spark-spec

Test suite to document the behavior of Spark
Scala
21
star
18

gill

An example PySpark project with pytest
Python
18
star
19

directed_graph

Modeling directed acyclic graphs (DAG) for topological sorting, shortest path, longest path, etc.
Ruby
14
star
20

spark-slack

Speak Slack notifications and process Slack slash commands
Scala
13
star
21

scalatest-example

Testing Scala code with scalatest
Scala
11
star
22

python-parquet-examples

Using the Parquet file format with Python
Python
11
star
23

levi

Delta Lake helper methods. No Spark dependency.
Python
10
star
24

unicron

DAGs on DAGs! Smart PySpark custom transformation runner
Python
10
star
25

pysparktestingexample

PySpark testing example project
Python
9
star
26

JavaSpark

Example Spark project with Java API
Java
9
star
27

spark-pika

Demo how to set up Spark with SBT
Scala
7
star
28

spark-etl

Lightweight Spark ETL framework
Scala
6
star
29

slack_trello

Helping Slack and Trello play together nicely
Ruby
6
star
30

mill_spark_example

Apache Spark project with the Mill build tool
Scala
6
star
31

mrpowers-benchmarks

MrPowers benchmarks for Dask, Polars, DataFusion, and pandas
Jupyter Notebook
5
star
32

pydata-style-guide

Style for the PyData stack
5
star
33

walle

Compression algorithms for different file formats
Python
5
star
34

angelou

PySpark on Poetry example
Python
5
star
35

great-spark

Curated collection of Spark libraries and example applications
5
star
36

appa

Data lake metadata / transaction log store
Python
5
star
37

turf

Set application variables for the development, test, and production environments
Ruby
5
star
38

eren

PySpark Hive helper methods
Python
5
star
39

prawn_charts

Prawn gem to develop vector line charts
Ruby
5
star
40

spark-bulba

Tutorial on running faster tests with Spark
Scala
4
star
41

ml-book

Introduction to Machine Learning with Python Book
Jupyter Notebook
4
star
42

blake

Great Pandas and Jupyter workflow with Poetry
Jupyter Notebook
4
star
43

cmap

Model cmap exports as a directed graph and generate SQL
Ruby
4
star
44

redshift_extractor

Using the Redshift UNLOAD/COPY commands to move data from one Redshift cluster/database to another
Ruby
4
star
45

deltadask

Delta Lake powered by Dask
Jupyter Notebook
4
star
46

spark-frameless

Typed Datasets with Spark
Scala
4
star
47

spark-examples

A Spark playground to help me write blog posts
Scala
4
star
48

slack_notifier_wrapper

Making it easier to work with the slack_notifier gem
Ruby
3
star
49

scalate-example

Templates in Scala with Scalate
Scala
3
star
50

rails-startbootstrap-creative

Creative by Start Bootstrap - Rails Version
Ruby
3
star
51

repo_tools

Easily manage clone Git repos in Ruby applications
Ruby
3
star
52

munit-example

Simple example of the MUnit testing library
Scala
3
star
53

hll-example

Implementing HyperLogLog functions in Spark
Scala
3
star
54

pyspark-spec

Documents the behavior of pyspark
Python
3
star
55

mungingdata

Code to support MungingData blog posts: https://mungingdata.com/
Scala
3
star
56

dask-interop

Integration tests to demonstrate Dask's interoperability with other systems
Python
3
star
57

dask-fun

Dask examples with tests
Jupyter Notebook
3
star
58

vimtraining

Practicing Vim after completing the vimtutor
2
star
59

scala-design

Core Scala language features and design patterns
Scala
2
star
60

GameBoard

This is a GameBoard class with methods to help analyze the grid.
Ruby
2
star
61

technical_writing

Elements of style for blogs, books, and presentations
2
star
62

cali

Guide to provision a Mac for developers
Vim Script
2
star
63

data-scrapbook

A collection of images and captions to explain core data concepts
2
star
64

sapo

Data store validator for sqlite, Parquet
Python
2
star
65

learn_spanish

Logically learn Spanish
Ruby
2
star
66

project_euler

Some Project Euler solutions
Ruby
2
star
67

yellow-taxi

Data lake fun!
Scala
2
star
68

sqlite-example

Creating a sqlite db and writing it to files
Jupyter Notebook
2
star
69

http_validator

Ruby
2
star
70

polars-fun

Example notebooks for how to use pola.rs
Jupyter Notebook
2
star
71

mesita

Print colorful tables with nice diffs in the Terminal
Python
2
star
72

spark-utest

Example of how to use uTest with Spark
Scala
1
star
73

mrpowers-book

Book on MrPowers OSS projects, blogs, and other assets
1
star
74

doctor_scrabble

Rails Scrabble App
Ruby
1
star
75

dotfiles

My dotfiles
Shell
1
star
76

mini_yelp

Ruby
1
star
77

eli5_ruby_cs

explain like I'm 5: computer science with ruby
Ruby
1
star
78

mrpowers.github.io

Documentation and stuff
HTML
1
star
79

pyspark-examples

PySpark example notebooks
Jupyter Notebook
1
star
80

tic_tac_toe_js

A tic tac toe game, written in JS, with DOM crap isolated out of the way
JavaScript
1
star
81

go-example

Simple Go project
Go
1
star
82

custom_tableau

Using JavaScript to create Tableu-like dashboards
JavaScript
1
star
83

ansible_playbooks

Ansible playbooks
Ruby
1
star
84

javascript_book

Teaching JavaScript logically without being dorks
1
star
85

rails-startbootstrap-freelancer

Rails implementation of the Start Bootstrap Freelancer theme
CSS
1
star
86

express_practice

Some practice exercises for building Node and Express applications
JavaScript
1
star