• Stars
    star
    3
  • Rank 3,963,521 (Top 79 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This tool is designed to look through your HDFS folders to ether identify files with no data in them or delete files with no data in them.

More Repositories

1

SparkOnKudu

Based off the design of SparkOnHBase. This Repo will support Spark, Spark Streaming, and Spark SQL integration with Kudu.
Scala
51
star
2

SparkStreaming.Sessionization

NRT Sessionization with Spark Streaming landing on HDFS and putting live stats in HBase
Scala
51
star
3

SparkUnitTestingExamples

This project is a collection of Spark Unit Tests Examples to help new Spark users have good examples on how to unit start their code for Spark Core, Spark SQL, and Spark Streaming
Scala
36
star
4

Spark.TableStatsExample

Simple Spark example of generating table stats for use of data quality checks
Scala
28
star
5

HBase-ToHDFS

Reads a HBase table and writes the out as Text, Seq, Avro, or Parquet
Java
28
star
6

SparkOnHBase

Scala
24
star
7

SparkOnALog

Examples of Integrating Spark Streaming, Flume, and HBase to solve Streaming problems
Java
19
star
8

CopybookInputFormat

Using JRecord to build a mapred and mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark, ...
Java
18
star
9

HBase.MCC

HBase.MCC (HBase Multi Cluster Client). The goal is to support aways up solutions with HBase through multiple clusters
Java
14
star
10

Hive.Generate.DDL

Generation tool that generates DDLs and simple data load scripts.
Java
10
star
11

hadcom.utils

Advanced common functionality for hadoop
Java
6
star
12

Taxi360

Simple Example of HBase, SolR, and Kudu for Entity 360 using NY taxi data
Scala
6
star
13

teraSort-Compressed

The rules of tera sort say you can't compress the input and output. Well those rules are out of touch with how real use cases on hadoop.
Java
4
star
14

FileIngestor

A simple program to put files from a directory into HDFS with the added functionality and defining how that action will happen
Java
3
star
15

Spark..Unique.Seq.Generator

This is an example of how to make Unique Sequences in a distributed way with Spark (No dups, No Skips)
Java
3
star
16

Spark.GraphX.Examples

Just some example of using GraphX
Scala
3
star
17

Flume.NettyAvroAsyncRpcClient

This is a layer on top of the Flume NettyAvroRpcClient that allows for multiple connects to a server.
Java
2
star
18

MRSmallFileCombiner

Tool to read many small files in HDFS with MR while control allowing the caller to define the number of mappers.
Java
2
star
19

Spark.ProdictBehaviorBasedOnPastActives

This is an example of how to do window analysis with Spark
Scala
2
star
20

HBase.GetTopNRecords

This is a simple example to show how a single HBase "get" can retrieve the top N {items,amount} in the order of amount decresing
Java
2
star
21

HBaseMassiveBulkLoadUtils

This is a tool for testing and managing many repeatedly and large bulk loads on HBase
Java
2
star
22

AppTrans

Examples for training
Scala
1
star
23

EdgeNodeGraphUi

Connecting the power of the D3 graphing library to CDH (HDFS, HBase and Impala)
Java
1
star
24

HBase-FastTableCopy

This will contain implementations that will copy records from a table with less regions then the final table.
Java
1
star
25

spark.mergesort.example

An example of how to do a merge sort
Scala
1
star
26

FairSchedulerPlus

A upgrade Extended FairScheduler that takes Sub-Groups into account.
Java
1
star
27

MapReduce.Unique.Seq.Generator

This is a single map reduce job that will append a unique sequence number to the front of every row in a source file.
Java
1
star
28

FixedLengthInputFormat

This is a FixedLengthInputFormat for Hadoop map reduce.
Java
1
star
29

SparkStreamingSeqSink

Support to write Seq Files with Spark Streaming with similar functionality as Flume HDFS Sink with Seq Files
Scala
1
star
30

IngestProcessStoreInNRT

This is a demo/training application. Used to show how easy it is to do operations like ingestion, aggregation, and change data capture. Using tools like Kafka, Spark Streaming, Flume, Kudu, SolR, HBase, and HDFS
Scala
1
star