• Stars
    star
    102
  • Rank 335,584 (Top 7 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

flink-sql-benchmark

Step 1: Environment preparation

  • Recommended configuration for Hadoop cluster

    • Resource allocation
      • master *1 :
        • vCPU 32 cores, Memory: 128 GiB / System disk: 120GB *1, Data disk: 80GB *1
      • worker *15 :
        • vCPU 80 cores, Memory: 352 GiB / System disk: 120GB *1, Data disk: 7300GB *30
      • This document was tested in Hadoop3.2.1 and Hive3.1.2
    • Hadoop environment preparation
      • Install Hadoop, and then configure HDFS and Yarn according to Configuration Document
      • Set Hadoop environment variables: ${HADOOP_HOME} and ${HADOOP_CLASSPATH}
      • Configure Yarn, and then start ResourceManager and NodeManager
        • modify yarn.application.classpath in yarn-site, adding mapreduce's dependency to classpath:
          • $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$HADOOP_YARN_HOME/share/hadoop/mapreduce/*
    • Hive environment preparation
      • Run hive metastore and HiveServer2
        • Set environment variable: ${HIVE_CONF_DIR}. This variable equals to the site of hive-site.xml.
        • execute: nohup hive --service metastore & / nohup hive --service hiveservice2 &
    • Others
      • gcc is needed in your master node to build the TPC-DS data generator
  • TPC-DS benchmark project preparation

    • Download flink-sql-benchmark project via git clone https://github.com/ververica/flink-sql-benchmark.git to your master node
      • In this document, the benchmark project located absolutely path named ${INSTALL_PATH}
    • Generate the execution jar via cd ${INSTALL_PATH} and run mvn clean package -DskipTests (Please make sure maven had been installed)
  • Flink environment preparation

    • Download Flink-1.16 to folder ${INSTALL_PATH}/packages
      • Flink-1.16 download path
      • Replace flink-conf.yaml
        • mv ${INSTALL_PATH}/flink-sql-benchmark/tools/flink/flink-conf.yaml ${INSTALL_PATH}/flink-sql-benchmark/packages/flink-1.16/conf/flink-conf.yaml
      • Download flink-sql-connector-hive-3.1.2_2.12
      • Start yarn session cluster
        • cd ${INSTALL_PATH}/packages/flink-1.16/bin and run ./yarn-session.sh -d -qu default

Step 2: Generate TPC-DS dataset

Please cd ${INSTALL_PATH} first.

  • Set common environment variables
    • vim tools/common/env.sh:
      • None-partitioned table
        • SCALE is the size of the generated dataset (GB). Recommend value is 10000 (10TB)
          • This variable is recommended to use 1000
        • FLINK_TEST_DB is Hive database name, which will be used by Flink
          • This variable is recommended to use the default name: export FLINK_TEST_DB=tpcds_bin_orc_$SCALE
      • Partitioned table
        • FLINK_TEST_DB need to be modified to export FLINK_TEST_DB=tpcds_bin_partitioned_orc_$SCALE
  • Data generation
    • None-partitioned table
      • run ./tools/datagen/init_db_for_partiiton_tables.sh. init_db_for_partition_tables.sh will first launch a Mapreduce job to generate the data in text format which is stored in HDFS. Then it will create external Hive databases based on the generated text files.
        • Origin text format data will be stored in HDFS folder: /tmp/tpcds-generate/${SCALE}
        • Hive external database points to origin text data is: tpcds_text_${SCALE}
        • Hive orc format external database, which points to origin text data is: tpcds_bin_orc_${SCALE}
    • Partitioned table
      • run ./tools/datagen/init_db_for_none_partition_tables.sh. If origin text format data don't exist in HDFS, it will create origin file first. Otherwise, it will create external Hive databases with partition tables based on the generated text files.
        • Origin text format data will also be stored in HDFS folder: /tmp/tpcds-generate/${SCALE}
        • Hive external database points to origin text data is: tpcds_text_${SCALE}
        • Hive orc format external database with partition table, which points to origin text data is: tpcds_bin_partitioned_orc_${SCALE}
          • This command will be very slow because Hive dynamic partition data writing is very slow

Step 3: Generate table statistics for TPC-DS dataset

Please cd ${INSTALL_PATH} first.

  • Generate statistics for table and every column:
    • run ./tools/stats/analyze_table_stats.sh
      • Note: this step use Flink analyze table syntax to generate statistics, so it only supports since Flink-1.16
      • For partition table, this step is very slow. it's recommend to use Hive analyze table syntax, which will generate same stats with Flink analyze table syntax
        • The document for Hive analyze table syntax
        • In Hive3.x, it can run ANALYZE TABLE partition_table_name COMPUTE STATISTICS FOR COLUMNS; in hive client to generate stats for all partitions instead of specifying one partition

Please cd ${INSTALL_PATH} first.

  • Run TPC-DS query:
    • run specify query: ./tools/flink/run_query.sh 1 q1.sql
      • 1 stands for the iterator of execution, which default number is 1, q1.sql stands for query number (q1 -> q99).
      • If iterator of execution is 1, the total execution time will be longer than the query execution time because of Flink cluster warmup time cost
    • run all queries: ./tools/flink/run_query.sh 2
      • Because of running all queries will cost a long time, it recommends to run this command in nohup mode: nohup ./run_all_queries.sh 2 > partition_result.log 2>&1 &
  • Modify tools/flink/run_query.sh to add other optional factors:
    • --location: sql query path. If you only want to execute partial queries, you can use this parameter to specify a folder and place those sql files in this folder.
    • --mode: execute or explain, default is execute
  • Kill yarn jobs:
    • run yarn application -list to get the yarn job id of the current Flink job
    • run yarn application -kill application_xxx to kill this yarn job
    • If you stop the current Flink job in this way, you need to restart yarn session cluster the next time you run TPC-DS queries
      • cd ${INSTALL_PATH}/packages/flink-1.16/bin and run ./yarn-session.sh -d -qu default

More Repositories

1

flink-cdc-connectors

CDC Connectors for Apache Flinkยฎ
Java
4,956
star
2

flink-sql-cookbook

The Apache Flink SQL Cookbook is a curated collection of examples, patterns, and use cases of Apache Flink SQL. Many of the recipes are completely self-contained and can be run in Ververica Platform as is.
Dockerfile
842
star
3

flink-training-exercises

Java
552
star
4

sql-training

Java
543
star
5

flink-sql-gateway

Java
489
star
6

stateful-functions

Stateful Functions for Apache Flink
Java
276
star
7

flink-jdbc-driver

Java
128
star
8

ververica-platform-playground

Instructions for getting started with Ververica Platform on minikube.
Shell
89
star
9

frocksdb

C++
61
star
10

flink-statefun-workshop

Python
44
star
11

jupyter-vvp

Jupyter Integration for Flink SQL via Ververica Platform
Python
41
star
12

flink-training-troubleshooting

Java
40
star
13

lab-fraud-detection

Demo code for implementing and showcasing a Fraud Detection Engine with Apache Flink.
Java
30
star
14

streaming-ledger

Serializable ACID transactions on streaming data
Java
22
star
15

lab-flink-latency

Lab for testing different Flink job latency optimization techniques covered in a Flink Forward 2021 talk
Java
22
star
16

lab-flink-repository-analytics

Java
18
star
17

lab-sql-vs-datastream

Lab project to showcase Flink's performance differences between using a SQL query and implementing the same logic via the DataStream API
Java
13
star
18

flink-ecosystem

Ecosystem website for Apache Flink
TypeScript
12
star
19

tpc-ds-generators

Binaries for TPC-DS data generators
8
star
20

acwern

Flink visualization library for blogposts
TypeScript
6
star
21

ForSt

A Persistent Key-Value Store designed for Streaming processing
C++
4
star
22

demo-vvp-via-azure-pipelines

Java
3
star
23

pyflink-docs

pyflink documentation
Python
2
star
24

lab-vvp-pyflink

Java
2
star
25

flink-emr-terraform

Terraform module for creating AWS EMR Flink clusters.
HCL
1
star