holdenk/spark-validator

Stars
106
Rank 325,871 (Top 7 %)
Language
Scala
License
Apache License 2.0
Created over 10 years ago
Updated almost 7 years ago

holdenk/spark-validator

holdenk

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.

Spark Validator

A library you can include in your Spark job to validate the counters and perform operations on success.

This software should be considered pre-alpha.

Why you should validate counters

Maybe you are really lucky and you never have intermitent outages or bugs in your code.

If you have accumulators for things like records processed or number of errors, its really easy to write bounds for these. Even if you don't have custom counters you can use Spark's built in metrics (bytes read, time, etc.) and by looking at historic values we can establish reasonable bounds. This can help catch jobs which fail to process some of your records. This is not a replacement for unit or integration testing.

How spark validation works

We store all of the metrics from each run along with all of the accumulators you pass in.

If a run is successful it will run your on success handler. If you just want to mark the run as success you can specify a file for spark validator to touch.

How to write your validation rules

Absolute

Relative

How to build

sbt - Remember when it was called the simple build tool?

sbt/sbt compile

How to use

Scala

At the start of your Spark program once you have constructed your spark context call

import com.holdenkarau.spark_validator
...
val rules = List(
    new AbsoluteValueRule(counter = "recordsRead", min=Some(1000), max=None).
    ...)
val vc = new ValidationConf(counterPath, jobName, firstTime, rules)
val vl = new Validation(vc)
...
validator.validate()

Java

vNext

Python

vNext+1

License

spark-testing-base

Base classes to use when writing tests with Spark

learning-spark-examples

Examples for learning spark

elasticsearchspark

Elastic Search on Spark

spark-structured-streaming-ml

Structured Streaming Machine Learning example with Spark 2.0

sparkProjectTemplate.g8

Template for Spark Projects

spark-flowchart

Flowchart for debugging Spark applications

fastdataprocessingwithsparkexamples

Examples for Fast Data Processing with Spark

spark-upgrade

Magic to help Spark pipelines upgrade

chef-cookbook-spark

A chef cookbook for deploying spark

spark-intro-ml-pipeline-workshop

A simple introduction to using spark ml pipelines

Jupyter Notebook

fastdataprocessingwithspark-sharkexamples

Examples for Fast Data Processing with Spark example Shark project

holdensmagicalunicorn

diversity-analytics

Analytics on Apache Projects for Diversity

Jupyter Notebook

intro-to-pyspark-demos

Examples from Holden's intro to PySpark workshop. This is an intro level workshop focused on using Spark with Python.

clothes-from-code

Auto generate cool code based clothing [WIP]

remote-python-debugging-4-spark

Set up PDB on Spark

Jupyter Notebook

livestreaming-tools

Basic tools for livestreaming, very much to Holden's use case.

distributedcomputing4kids

distributedcomputing4kids

Jupyter Notebook

kafka-streams-python-cthulhu

Proof of concept integration of Python into Kafka Streams. Built w/Scala

stalin-hax

hax on top of stalin

spark-misc-utils

Misc Utils for Spark

wanderinghobos

resume

spark-ml-example

Some examples using Spark's machine learning library.

web2.0collage

github-rename-all-my-commits

Uses git filter-repo to rename all of your commits in all of your repos, intended for removing deadnames, will be funky with any forks you want to merge though.

print-the-world

I (attempt to) print everything* from places

beam-test-examples

[WIP] Examples for testing Apache BEAM

fnurbot

sparklingpinkpandas

Website for Sparkling Pink Pandas (queer, trans focused scooter club)

mydotfiles

My dotfiles. You probably don't care about this.

dnsrbl

A simple haskell interface to asynchronously lookup ip/name against a bunch of DNS based RBLs

colo-scripts

Costume-Code

Code for the Alice in Wonderland Costume

commerce

rails based e-commerce platform

talk-info

Info of my talks

datasciencecoursera

datasciencecoursera

not-so-deep-spark

A not so deep version of deep-spark

Jupyter Notebook