• Stars
    star
    373
  • Rank 114,600 (Top 3 %)
  • Language
    Java
  • Created almost 7 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

hive-testbench

A testbench for experimenting with Apache Hive at any data scale.

Overview

The hive-testbench is a data generator and set of queries that lets you experiment with Apache Hive at scale. The testbench allows you to experience base Hive performance on large datasets, and gives an easy way to see the impact of Hive tuning parameters and advanced settings.

Prerequisites

You will need:

  • Hadoop 2.2 or later cluster or Sandbox.
  • Apache Hive.
  • Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
  • If you plan to generate 1TB or more of data, using Apache Hive 13+ to generate the data is STRONGLY suggested.

Install and Setup

All of these steps should be carried out on your Hadoop cluster.

  • Step 1: Prepare your environment.

    In addition to Hadoop and Hive, before you begin ensure gcc is installed and available on your system path. If you system does not have it, install it using yum or apt-get.

  • Step 2: Decide which test suite(s) you want to use.

    hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found at the Transaction Processing Council homepage.

  • Step 3: Compile and package the appropriate data generator.

    For TPC-DS, ./tpcds-build.sh downloads, compiles and packages the TPC-DS data generator. For TPC-H, ./tpch-build.sh downloads, compiles and packages the TPC-H data generator.

  • Step 4: Decide how much data you want to generate.

    You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.

    If you want to generate a large amount of data, you should use Hive 13 or later. Hive 13 introduced an optimization that allows far more scalable data partitioning. Hive 12 and lower will likely crash if you generate more than a few hundred GB of data and tuning around the problem is difficult. You can generate text or RCFile data in Hive 13 and use it in multiple versions of Hive.

  • Step 5: Generate and load the data.

    The scripts tpcds-setup.sh and tpch-setup.sh generate and load data for TPC-DS and TPC-H, respectively. General usage is tpcds-setup.sh scale_factor [directory] or tpch-setup.sh scale_factor [directory]

    Some examples:

    Build 1 TB of TPC-DS data: ./tpcds-setup.sh 1000

    Build 1 TB of TPC-H data: ./tpch-setup.sh 1000

    Build 100 TB of TPC-DS data: ./tpcds-setup.sh 100000

    Build 30 TB of text formatted TPC-DS data: FORMAT=textfile ./tpcds-setup 30000

    Build 30 TB of RCFile formatted TPC-DS data: FORMAT=rcfile ./tpcds-setup 30000

    Also check other parameters in setup scripts important one is BUCKET_DATA.

  • Step 6: Run queries.

    More than 50 sample TPC-DS queries and all TPC-H queries are included for you to try. You can use hive, beeline or the SQL tool of your choice. The testbench also includes a set of suggested settings.

    This example assumes you have generated 1 TB of TPC-DS data during Step 5:

     cd sample-queries-tpcds
     hive -i testbench.settings
     hive> use tpcds_bin_partitioned_orc_1000;
     hive> source query55.sql;
    

    Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpch_flat_orc_1000. You can always show databases to get a list of available databases.

    Similarly, if you generated 1 TB of TPC-H data during Step 5:

     cd sample-queries-tpch
     hive -i testbench.settings
     hive> use tpch_flat_orc_1000;
     hive> source tpch_query1.sql;
    

Feedback

If you have questions, comments or problems, visit the Hortonworks Hive forum.

If you have improvements, pull requests are accepted.

More Repositories

1

cloudbreak

CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.
Java
353
star
2

gohadoop

Go
309
star
3

data-tutorials

Hortonworks tutorials
Shell
283
star
4

ansible-hortonworks

Ansible playbooks for deploying Hortonworks Data Platform and DataFlow using Ambari Blueprints
Python
248
star
5

simple-yarn-app

Simple YARN application
Java
167
star
6

streamline

StreamLine - Streaming Analytics
Java
164
star
7

kubernetes-yarn

Go
117
star
8

ambari-shell

CLI for Apache Ambari
Java
89
star
9

structor

Vagrant files creating multi-node virtual Hadoop clusters with or without security.
HTML
67
star
10

HDP-Public-Utilities

Shell
65
star
11

hoya

Deploys and manages applications within a YARN cluster
Java
64
star
12

hadoop-icons

58
star
13

hive-json

A rough prototype of a tool for discovering Apache Hive schemas from JSON documents.
Java
42
star
14

cloudbreak-deployer

Cloudbreak Deployer Tool
Shell
35
star
15

hortonworks-sandbox

hortonworks-sandbox
Python
34
star
16

spark-native-yarn

Tez port for Spark API
Scala
32
star
17

docker-e2e-protractor

This project is going to be retired soon, please use the successor at https://github.com/hortonworks/docker-e2e-cloud
Shell
24
star
18

cloud-haunter

Cloud agnostic resource monitoring and janitor tool
Go
22
star
19

docker-logrotate

Logrotation for docker containers
Shell
22
star
20

mini-dev-cluster

Mini YARN/DFS cluster for developing and testing YARN-based applications (e.g., Tez)
Java
20
star
21

ambari-rest-client

Groovy client library for Apache Ambari's REST API
Groovy
20
star
22

docker-socat

Shell
20
star
23

docker-cloudbreak-uaa

Docker container to run a UAA identity server
Dockerfile
19
star
24

dstream

Java
18
star
25

data_analytics_studio

16
star
26

cloudbreak-images

Saltstack scripts to bake amazon/gcc/azure/openstack images suitable for Cloudbreak
Shell
14
star
27

registry

Schema Registry
Java
13
star
28

cb-cli

Go
13
star
29

docker-cloudbreak

Docker image for Cloudbreak
Shell
12
star
30

cloudbreak-openstack-ansible

Setting up a production ready OpenStack installation
Shell
12
star
31

efm

Java
11
star
32

templeton

New Templeton Repository
Java
10
star
33

docker-protractor

Ubuntu Docker Image for Protractor
Shell
8
star
34

nifi-android-s2s

Java
7
star
35

hadoop0

A Docker sandbox with Hadoop 0.0 (aka Nutch 0.8-dev) and word count example.
Shell
7
star
36

HBaseReplicationBridgeServer

HBase Replication Bridge Server
Java
7
star
37

fieldeng-nifi-druid-integration

Java
7
star
38

docker-cloudbreak-autoscale

Docker image with Periscope
Shell
7
star
39

cloudbreak-docs

Cloudbreak 1.x documentation repo
JavaScript
6
star
40

spark-native-yarn-samples

Scala
5
star
41

fieldeng-scythe

Time Series Library
Scala
5
star
42

fieldeng-modern-clickstream

Shell
5
star
43

fluid-bootstrap-theme

FLUID product design system theme for Bootstrap.
HTML
5
star
44

cloudbreak-documentation

Cloudbreak 2.0 - 2.7.x documentation repo. Cloudbreak 2.8+ docs are stored in the https://github.com/hortonworks/dita-docs repo
CSS
5
star
45

cbd-quickstart

Shell
5
star
46

nifi-ios-s2s

Repository for an iOS client library for Apache NiFi
Objective-C
5
star
47

bman

Bman - An Apache Hadoop cluster manager
Python
4
star
48

salt-bootstrap

Tool for bootstrapping VMs launched by Cloudbreak
Go
4
star
49

docker-haveged

Haveged container to increase entropy
Makefile
4
star
50

pso-hdp-local-repo

Scripts used to create a Local Repo for installations.
Shell
4
star
51

fieldeng-pyscythe

Python Time Series Library
Python
2
star
52

minifi-java

Java
2
star
53

fieldeng-device-manager-demo

Java
2
star
54

docker-cloudbreak-shell

Shell
2
star
55

HA-Monitor

Java
2
star
56

fieldeng-cronus

Industrial IoT NiFi Layer
Java
2
star
57

isa-l-release

Public isa_l release repository
1
star
58

cloudbreak-service-registration

Go
1
star
59

fieldeng-ad-server

JavaScript
1
star
60

fieldeng-biologics-manufacturing-demo

Shell
1
star
61

docker-cloudbreak-autoscale-db

Shell
1
star
62

dp-cli-common

Go
1
star
63

docker-cloudbreak-server-db

Shell
1
star
64

fieldeng-nifi-atlas-lineage-reporter

Java
1
star
65

hue-release

Public hue release repository
1
star
66

fieldeng-data-simulators

Java
1
star
67

fieldeng-retail-store-monitor-demo

Java
1
star
68

fieldeng-nifi-livy-integration

Java
1
star
69

DGC-aetna

JavaScript
1
star
70

ccp-chain-parsing

Java
1
star
71

vega-lite-ui

JavaScript
1
star
72

fieldeng-credit-card-transaction-monitor-mobile-app

Java
1
star
73

azure-cbd-quickstart

Shell
1
star
74

docker-mybatis-migrations

Shell
1
star
75

minifi-cpp

C++
1
star
76

fieldeng-rhea

Industrial IoT UI Layer
CSS
1
star
77

iop-solr-stack

Ambari Solr mpack for helping BI/HDP migration
Python
1
star