• Stars
    star
    1,492
  • Rank 30,268 (Top 0.7 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated 22 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Base classes to use when writing tests with Spark

build status

spark-testing-base

Base classes to use when writing tests with Spark.

Why?

You've written an awesome program in Spark and now its time to write some tests. Only you find yourself writing the code to setup and tear down local mode Spark in between each suite and you say to your self: This is not my beautiful code.

How?

So you include com.holdenkarau.spark-testing-base [spark_version]_1.4.0 and extend one of the classes and write some simple tests instead. For example to include this in a project using Spark 3.0.0:

"com.holdenkarau" %% "spark-testing-base" % "3.0.0_1.4.0" % "test"

or

<dependency>
	<groupId>com.holdenkarau</groupId>
	<artifactId>spark-testing-base_2.12</artifactId>
	<version>${spark.version}_1.4.0</version>
	<scope>test</scope>
</dependency>

How to use it inside your code? have a look at the wiki page.

The Maven repositories page for spark-testing-base lists the releases available.

The Python package of spark-testing-base is available via:

Minimum Memory Requirements and OOMs

The default SBT testing java options are too small to support running many of the tests due to the need to launch Spark in local mode. To increase the amount of memory in a build.sbt file you can add:

fork in Test := true
javaOptions ++= Seq("-Xms8G", "-Xmx8G", "-XX:MaxPermSize=4048M", "-XX:+CMSClassUnloadingEnabled")

Note: if your running in JDK17+ PermSize and ClassnloadingEnabled have been removed so it becomes:

fork in Test := true
javaOptions ++= Seq("-Xms8G", "-Xmx8G"),

If using surefire you can add:

<argLine>-Xmx2048m -XX:MaxPermSize=2048m</argLine>

Note: the specific memory values are examples only (and the values used to run spark-testing-base's own tests).

Special considerations

Make sure to disable parallel execution.

In sbt you can add:

parallelExecution in Test := false

In surefire make sure that forkCount is set to 1 and reuseForks is true.

If your testing Spark SQL CodeGen make sure to set SPARK_TESTING=true

Codegen tests and Running Spark Testing Base's own tests

If you are testing codegen it's important to have SPARK_TESTING set to yes, as we do in our github actions.

SPARK_TESTING=yes ./build/sbt clean +compile +test -DsparkVersion=$SPARK_VERSION

Where is this from?

Some of this code is a stripped down version of the test suite bases that are in Apache Spark but are not accessible. Other parts are also inspired by sscheck (scalacheck generators for Spark).

Other parts of this are implemented on top of the test suite bases to make your life even easier.

How do I build this?

This project is built with sbt.

What are some other options?

While we hope you choose our library, https://github.com/juanrh/sscheck , https://github.com/hammerlab/spark-tests , https://github.com/wdm0006/DummyRDD , and more https://www.google.com/search?q=python+spark+testing+libraries exist as options.

Release Notes

Security Disclosure e-mails

Have you found a security concern? Please let us know

See https://github.com/holdenk/spark-testing-base/blob/main/SECURITY.md

More Repositories

1

learning-spark-examples

Examples for learning spark
Java
334
star
2

elasticsearchspark

Elastic Search on Spark
Scala
112
star
3

spark-validator

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.
Scala
103
star
4

spark-structured-streaming-ml

Structured Streaming Machine Learning example with Spark 2.0
Scala
89
star
5

sparkProjectTemplate.g8

Template for Spark Projects
Scala
88
star
6

spark-flowchart

Flowchart for debugging Spark applications
Shell
83
star
7

fastdataprocessingwithsparkexamples

Examples for Fast Data Processing with Spark
Scala
59
star
8

chef-cookbook-spark

A chef cookbook for deploying spark
Ruby
30
star
9

spark-upgrade

Magic to help Spark pipelines upgrade
Python
28
star
10

spark-intro-ml-pipeline-workshop

A simple introduction to using spark ml pipelines
Jupyter Notebook
25
star
11

fastdataprocessingwithspark-sharkexamples

Examples for Fast Data Processing with Spark example Shark project
Scala
22
star
12

holdensmagicalunicorn

Perl
18
star
13

diversity-analytics

Analytics on Apache Projects for Diversity
Jupyter Notebook
18
star
14

intro-to-pyspark-demos

Examples from Holden's intro to PySpark workshop. This is an intro level workshop focused on using Spark with Python.
14
star
15

clothes-from-code

Auto generate cool code based clothing [WIP]
Python
12
star
16

remote-python-debugging-4-spark

Set up PDB on Spark
Jupyter Notebook
10
star
17

livestreaming-tools

Basic tools for livestreaming, very much to Holden's use case.
Python
7
star
18

distributedcomputing4kids

distributedcomputing4kids
Jupyter Notebook
6
star
19

kafka-streams-python-cthulhu

Proof of concept integration of Python into Kafka Streams. Built w/Scala
Python
5
star
20

stalin-hax

hax on top of stalin
C
4
star
21

spark-misc-utils

Misc Utils for Spark
Scala
4
star
22

wanderinghobos

Scheme
4
star
23

resume

latex resume
TeX
4
star
24

spark-ml-example

Some examples using Spark's machine learning library.
Scala
3
star
25

web2.0collage

Scheme
3
star
26

github-rename-all-my-commits

Uses git filter-repo to rename all of your commits in all of your repos, intended for removing deadnames, will be funky with any forks you want to merge though.
Shell
3
star
27

print-the-world

I (attempt to) print everything* from places
Python
2
star
28

beam-test-examples

[WIP] Examples for testing Apache BEAM
Java
2
star
29

fnurbot

Scala
2
star
30

sparklingpinkpandas

Website for Sparkling Pink Pandas (queer, trans focused scooter club)
JavaScript
2
star
31

colo-scripts

Shell
2
star
32

mydotfiles

My dotfiles. You probably don't care about this.
Shell
2
star
33

dnsrbl

A simple haskell interface to asynchronously lookup ip/name against a bunch of DNS based RBLs
2
star
34

Costume-Code

Code for the Alice in Wonderland Costume
Java
1
star
35

commerce

rails based e-commerce platform
Ruby
1
star
36

talk-info

Info of my talks
1
star
37

not-so-deep-spark

A not so deep version of deep-spark
Jupyter Notebook
1
star
38

datasciencecoursera

datasciencecoursera
1
star