• Stars
    star
    106
  • Rank 325,871 (Top 7 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 10 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.

buildstatus codecov.io

Spark Validator

A library you can include in your Spark job to validate the counters and perform operations on success.

This software should be considered pre-alpha.

Why you should validate counters

Maybe you are really lucky and you never have intermitent outages or bugs in your code.

If you have accumulators for things like records processed or number of errors, its really easy to write bounds for these. Even if you don't have custom counters you can use Spark's built in metrics (bytes read, time, etc.) and by looking at historic values we can establish reasonable bounds. This can help catch jobs which fail to process some of your records. This is not a replacement for unit or integration testing.

How spark validation works

We store all of the metrics from each run along with all of the accumulators you pass in.

If a run is successful it will run your on success handler. If you just want to mark the run as success you can specify a file for spark validator to touch.

How to write your validation rules

Absolute

Relative

How to build

sbt - Remember when it was called the simple build tool?

sbt/sbt compile

How to use

Scala

At the start of your Spark program once you have constructed your spark context call

import com.holdenkarau.spark_validator
...
val rules = List(
    new AbsoluteValueRule(counter = "recordsRead", min=Some(1000), max=None).
    ...)
val vc = new ValidationConf(counterPath, jobName, firstTime, rules)
val vl = new Validation(vc)
...
validator.validate()

Java

vNext

Python

vNext+1

License

More Repositories

1

spark-testing-base

Base classes to use when writing tests with Spark
Scala
1,513
star
2

learning-spark-examples

Examples for learning spark
Java
333
star
3

elasticsearchspark

Elastic Search on Spark
Scala
112
star
4

spark-structured-streaming-ml

Structured Streaming Machine Learning example with Spark 2.0
Scala
92
star
5

sparkProjectTemplate.g8

Template for Spark Projects
Scala
88
star
6

spark-flowchart

Flowchart for debugging Spark applications
Shell
83
star
7

fastdataprocessingwithsparkexamples

Examples for Fast Data Processing with Spark
Scala
59
star
8

spark-upgrade

Magic to help Spark pipelines upgrade
Python
33
star
9

chef-cookbook-spark

A chef cookbook for deploying spark
Ruby
30
star
10

spark-intro-ml-pipeline-workshop

A simple introduction to using spark ml pipelines
Jupyter Notebook
26
star
11

fastdataprocessingwithspark-sharkexamples

Examples for Fast Data Processing with Spark example Shark project
Scala
22
star
12

holdensmagicalunicorn

Perl
18
star
13

diversity-analytics

Analytics on Apache Projects for Diversity
Jupyter Notebook
18
star
14

intro-to-pyspark-demos

Examples from Holden's intro to PySpark workshop. This is an intro level workshop focused on using Spark with Python.
14
star
15

clothes-from-code

Auto generate cool code based clothing [WIP]
Python
12
star
16

remote-python-debugging-4-spark

Set up PDB on Spark
Jupyter Notebook
10
star
17

livestreaming-tools

Basic tools for livestreaming, very much to Holden's use case.
Python
7
star
18

distributedcomputing4kids

distributedcomputing4kids
Jupyter Notebook
6
star
19

kafka-streams-python-cthulhu

Proof of concept integration of Python into Kafka Streams. Built w/Scala
Python
5
star
20

stalin-hax

hax on top of stalin
C
4
star
21

spark-misc-utils

Misc Utils for Spark
Scala
4
star
22

wanderinghobos

Scheme
4
star
23

resume

latex resume
TeX
4
star
24

spark-ml-example

Some examples using Spark's machine learning library.
Scala
3
star
25

web2.0collage

Scheme
3
star
26

github-rename-all-my-commits

Uses git filter-repo to rename all of your commits in all of your repos, intended for removing deadnames, will be funky with any forks you want to merge though.
Shell
3
star
27

print-the-world

I (attempt to) print everything* from places
Python
2
star
28

beam-test-examples

[WIP] Examples for testing Apache BEAM
Java
2
star
29

fnurbot

Scala
2
star
30

sparklingpinkpandas

Website for Sparkling Pink Pandas (queer, trans focused scooter club)
JavaScript
2
star
31

mydotfiles

My dotfiles. You probably don't care about this.
Shell
2
star
32

dnsrbl

A simple haskell interface to asynchronously lookup ip/name against a bunch of DNS based RBLs
2
star
33

colo-scripts

Shell
2
star
34

Costume-Code

Code for the Alice in Wonderland Costume
Java
1
star
35

commerce

rails based e-commerce platform
Ruby
1
star
36

talk-info

Info of my talks
1
star
37

datasciencecoursera

datasciencecoursera
1
star
38

not-so-deep-spark

A not so deep version of deep-spark
Jupyter Notebook
1
star