tmalaska/SparkUnitTestingExamples

Stars
36
Rank 735,472 (Top 15 %)
Language
Scala
License
Apache License 2.0
Created over 8 years ago
Updated about 4 years ago

tmalaska/SparkUnitTestingExamples

tmalaska

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

This project is a collection of Spark Unit Tests Examples to help new Spark users have good examples on how to unit start their code for Spark Core, Spark SQL, and Spark Streaming

SparkOnKudu

Based off the design of SparkOnHBase. This Repo will support Spark, Spark Streaming, and Spark SQL integration with Kudu.

SparkStreaming.Sessionization

NRT Sessionization with Spark Streaming landing on HDFS and putting live stats in HBase

Spark.TableStatsExample

Simple Spark example of generating table stats for use of data quality checks

HBase-ToHDFS

Reads a HBase table and writes the out as Text, Seq, Avro, or Parquet

SparkOnHBase

SparkOnALog

Examples of Integrating Spark Streaming, Flume, and HBase to solve Streaming problems

CopybookInputFormat

Using JRecord to build a mapred and mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark, ...

HBase.MCC

HBase.MCC (HBase Multi Cluster Client). The goal is to support aways up solutions with HBase through multiple clusters

Hive.Generate.DDL

Generation tool that generates DDLs and simple data load scripts.

hadcom.utils

Advanced common functionality for hadoop

Taxi360

Simple Example of HBase, SolR, and Kudu for Entity 360 using NY taxi data

teraSort-Compressed

The rules of tera sort say you can't compress the input and output. Well those rules are out of touch with how real use cases on hadoop.

CleanUpEmptyFilesTool

This tool is designed to look through your HDFS folders to ether identify files with no data in them or delete files with no data in them.

FileIngestor

A simple program to put files from a directory into HDFS with the added functionality and defining how that action will happen

Spark..Unique.Seq.Generator

This is an example of how to make Unique Sequences in a distributed way with Spark (No dups, No Skips)

Spark.GraphX.Examples

Just some example of using GraphX

Flume.NettyAvroAsyncRpcClient

This is a layer on top of the Flume NettyAvroRpcClient that allows for multiple connects to a server.

MRSmallFileCombiner

Tool to read many small files in HDFS with MR while control allowing the caller to define the number of mappers.

Spark.ProdictBehaviorBasedOnPastActives

This is an example of how to do window analysis with Spark

HBase.GetTopNRecords

This is a simple example to show how a single HBase "get" can retrieve the top N {items,amount} in the order of amount decresing

HBaseMassiveBulkLoadUtils

This is a tool for testing and managing many repeatedly and large bulk loads on HBase

AppTrans

Examples for training

EdgeNodeGraphUi

Connecting the power of the D3 graphing library to CDH (HDFS, HBase and Impala)

HBase-FastTableCopy

This will contain implementations that will copy records from a table with less regions then the final table.

spark.mergesort.example

An example of how to do a merge sort

FairSchedulerPlus

A upgrade Extended FairScheduler that takes Sub-Groups into account.

MapReduce.Unique.Seq.Generator

This is a single map reduce job that will append a unique sequence number to the front of every row in a source file.

FixedLengthInputFormat

This is a FixedLengthInputFormat for Hadoop map reduce.

SparkStreamingSeqSink

Support to write Seq Files with Spark Streaming with similar functionality as Flume HDFS Sink with Seq Files

IngestProcessStoreInNRT

This is a demo/training application. Used to show how easy it is to do operations like ingestion, aggregation, and change data capture. Using tools like Kafka, Spark Streaming, Flume, Kudu, SolR, HBase, and HDFS