• Stars
    star
    103
  • Rank 333,046 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

RAPIDS GPU-BDB

RAPIDS GPU-BDB

Disclaimer

gpu-bdb is derived from TPCx-BB. Any results based on gpu-bdb are considered unofficial results, and per TPC policy, cannot be compared against official TPCx-BB results.

Overview

The GPU Big Data Benchmark (gpu-bdb) is a RAPIDS library based benchmark for enterprises that includes 30 queries representing real-world ETL & ML workflows at various "scale factors": SF1000 is 1 TB of data, SF10000 is 10TB. Each β€œquery” is in fact a model workflow that can include SQL, user-defined functions, careful sub-setting and aggregation, and machine learning.

Conda Environment Setup

We provide a conda environment definition specifying all RAPIDS dependencies needed to run our query implementations. To install and activate it:

CONDA_ENV="rapids-gpu-bdb"
conda env create --name $CONDA_ENV -f gpu-bdb/conda/rapids-gpu-bdb.yml
conda activate rapids-gpu-bdb

Installing RAPIDS bdb Tools

This repository includes a small local module containing utility functions for running the queries. You can install it with the following:

cd gpu-bdb/gpu_bdb
python -m pip install .

This will install a package named bdb-tools into your Conda environment. It should look like this:

conda list | grep bdb
bdb-tools                 0.2                      pypi_0    pypi

Note that this Conda environment needs to be replicated or installed manually on all nodes, which will allow starting one dask-cuda-worker per node.

NLP Query Setup

Queries 10, 18, and 19 depend on two static (negativeSentiment.txt, positiveSentiment.txt) files. As we cannot redistribute those files, you should download the tpcx-bb toolkit and extract them to your data directory on your shared filesystem:

jar xf bigbenchqueriesmr.jar
cp tpcx-bb1.3.1/distributions/Resources/io/bigdatabenchmark/v1/queries/q10/*.txt ${DATA_DIR}/sentiment_files/

For Query 27, we rely on spacy. To download the necessary language model after activating the Conda environment:

python -m spacy download en_core_web_sm

Starting Your Cluster

We use the dask-scheduler and dask-cuda-worker command line interfaces to start a Dask cluster. We provide a cluster_configuration directory with a bash script to help you set up an NVLink-enabled cluster using UCX.

Before running the script, you'll make changes specific to your environment.

In cluster_configuration/cluster-startup.sh:

- Update `GPU_BDB_HOME=...` to location on disk of this repo
- Update `CONDA_ENV_PATH=...` to refer to your conda environment path.
- Update `CONDA_ENV_NAME=...` to refer to the name of the conda environment you created, perhaps using the `yml` files provided in this repository.
- Update `INTERFACE=...` to refer to the relevant network interface present on your cluster.
- Update `CLUSTER_MODE="TCP"` to refer to your communication method, either "TCP" or "NVLINK". You can also configure this as an environment variable.
- You may also need to change the `LOCAL_DIRECTORY` and `WORKER_DIR` depending on your filesystem. Make sure that these point to a location to which you have write access and that `LOCAL_DIRECTORY` is accessible from all nodes.

To start up the cluster on your scheduler node, please run the following from gpu_bdb/cluster_configuration/. This will spin up a scheduler and one Dask worker per GPU.

DASK_JIT_UNSPILL=True CLUSTER_MODE=NVLINK bash cluster-startup.sh SCHEDULER

Note: Don't use DASK_JIT_UNSPILL when running BlazingSQL queries.

Then run the following on every other node from gpu_bdb/cluster_configuration/.

bash cluster-startup.sh

This will spin up one Dask worker per GPU. If you are running on a single node, you will only need to run bash cluster-startup.sh SCHEDULER.

If you are using a Slurm cluster, please adapt the example Slurm setup in gpu_bdb/benchmark_runner/slurm/ which uses gpu_bdb/cluster_configuration/cluster-startup-slurm.sh.

Running the Queries

To run a query, starting from the repository root, go to the query specific subdirectory. For example, to run q07:

cd gpu_bdb/queries/q07/

The queries assume that they can attach to a running Dask cluster. Cluster address and other benchmark configuration lives in a yaml file (gpu_bdb/benchmark_runner/becnhmark_config.yaml). You will need to fill this out as appropriate if you are not using the Slurm cluster configuration.

conda activate rapids-gpu-bdb
python gpu_bdb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml

To NSYS profile a gpu-bdb query, change start_local_cluster in benchmark_config.yaml to True and run:

nsys profile -t cuda,nvtx python gpu_bdb_query_07_dask_sql.py --config_file=../../benchmark_runner/benchmark_config.yaml

Note: There is no need to start workers with cluster-startup.sh as there is a LocalCUDACluster being started in attach_to_cluster API.

Performance Tracking

This repository includes optional performance-tracking automation using Google Sheets. To enable logging query runtimes, on the client node:

export GOOGLE_SHEETS_CREDENTIALS_PATH=<path to creds.json>

Then configure the --sheet and --tab arguments in benchmark_config.yaml.

Running all of the Queries

The included benchmark_runner.py script will run all queries sequentially. Configuration for this type of end-to-end run is specified in benchmark_runner/benchmark_config.yaml.

To run all queries, cd to gpu_bdb/ and:

python benchmark_runner.py --config_file benchmark_runner/benchmark_config.yaml

By default, this will run each Dask query five times, and, if BlazingSQL queries are enabled in benchmark_config.yaml, each BlazingSQL query five times. You can control the number of repeats by changing the N_REPEATS variable in the script.

BlazingSQL

BlazingSQL implementations of all queries are included. BlazingSQL currently supports communication via TCP. To run BlazingSQL queries, please follow the instructions above to create a cluster using CLUSTER_MODE=TCP.

Data Generation

The RAPIDS queries expect Apache Parquet formatted data. We provide a script which can be used to convert bigBench dataGen's raw CSV files to optimally sized Parquet partitions.

More Repositories

1

cudf

cuDF - GPU DataFrame Library
C++
8,319
star
2

cuml

cuML - RAPIDS Machine Learning Library
C++
3,864
star
3

cugraph

cuGraph - RAPIDS Graph Analytics Library
Cuda
1,668
star
4

cusignal

cuSignal - RAPIDS Signal Processing Library
Python
703
star
5

raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
Cuda
586
star
6

jupyterlab-nvdashboard

A JupyterLab extension for displaying dashboards of GPU usage.
TypeScript
582
star
7

notebooks

RAPIDS Sample Notebooks
Shell
577
star
8

cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms
Jupyter Notebook
543
star
9

rmm

RAPIDS Memory Manager
C++
420
star
10

deeplearning

Jupyter Notebook
336
star
11

cucim

cuCIM - RAPIDS GPU-accelerated image processing library
Jupyter Notebook
333
star
12

dask-cuda

Utilities for Dask and CUDA interactions
Python
266
star
13

cuxfilter

GPU accelerated cross filtering with cuDF.
Python
261
star
14

node

GPU-accelerated data science and visualization in node
TypeScript
170
star
15

clx

A collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases.
Jupyter Notebook
167
star
16

libgdf

[ARCHIVED] C GPU DataFrame Library
Cuda
138
star
17

dask-cudf

[ARCHIVED] Dask support for distributed GDF object --> Moved to cudf
Python
135
star
18

cloud-ml-examples

A collection of Machine Learning examples to get started with deploying RAPIDS in the Cloud
Jupyter Notebook
134
star
19

ucx-py

Python bindings for UCX
Python
118
star
20

kvikio

KvikIO - High Performance File IO
Python
100
star
21

plotly-dash-rapids-census-demo

Jupyter Notebook
92
star
22

gputreeshap

C++
83
star
23

frigate

Frigate is a tool for automatically generating documentation for your Helm charts
Python
76
star
24

wholegraph

WholeGraph - large scale Graph Neural Networks
Cuda
75
star
25

spark-examples

[ARCHIVED] Moved to github.com/NVIDIA/spark-xgboost-examples
Jupyter Notebook
70
star
26

docker

Dockerfile templates for creating RAPIDS Docker Images
Shell
69
star
27

cuvs

cuVS - a library for vector search and clustering on the GPU
Jupyter Notebook
57
star
28

custrings

[ARCHIVED] GPU String Manipulation --> Moved to cudf
Cuda
46
star
29

docs

RAPIDS Documentation Site
HTML
34
star
30

cudf-alpha

[ARCHIVED] cuDF [alpha] - RAPIDS Merge of GoAi into cuDF
34
star
31

rapids-examples

Jupyter Notebook
31
star
32

nvgraph

C++
26
star
33

rapids-cmake

CMake
24
star
34

cuhornet

Cuda
24
star
35

cuDataShader

Jupyter Notebook
22
star
36

gpuci-build-environment

Common build environment used by gpuCI for building RAPIDS
Dockerfile
19
star
37

distributed-join

C++
19
star
38

devcontainers

Shell
18
star
39

dask-cuml

[ARCHIVED] Dask support for multi-GPU machine learning algorithms --> Moved to cuml
Python
16
star
40

integration

RAPIDS - combined conda package & integration tests for all of RAPIDS libraries
Shell
15
star
41

xgboost-conda

Conda recipes for xgboost
Jupyter Notebook
12
star
42

benchmark

Python
11
star
43

ucxx

C++
11
star
44

dependency-file-generator

Python
10
star
45

asvdb

Python
9
star
46

helm-chart

Shell
9
star
47

deployment

RAPIDS Deployment Documentation
Jupyter Notebook
9
star
48

miniforge-cuda

Dockerfile
9
star
49

ci-imgs

Dockerfile
7
star
50

dask-cugraph

Python
7
star
51

rapids.ai

rapids.ai web site
HTML
7
star
52

ptxcompiler

Python
6
star
53

GaaS

Python
5
star
54

rvc

Go
4
star
55

scikit-learn-nv

Python
4
star
56

ops-bot

A Probot application used by the Ops team for automation.
TypeScript
4
star
57

workflows

Shell
4
star
58

rapids-triton

C++
4
star
59

dask-build-environment

Build environments for various dask related projects on gpuCI
Dockerfile
3
star
60

roc

GitHub utilities for the RAPIDS Ops team
Go
3
star
61

multi-gpu-tools

Shell
3
star
62

detect-weak-linking

Python
3
star
63

dask-cuda-benchmarks

Python
2
star
64

shared-workflows

Reusable GitHub Actions workflows for RAPIDS CI
Shell
2
star
65

rapids_triton_pca_example

C++
2
star
66

cugunrock

Cuda
2
star
67

dgl-cugraph-build-environment

Dockerfile
2
star
68

projects

Jupyter Notebook
2
star
69

crossfit

Metric calculation library
Python
2
star
70

gpuci-mgmt

Mangement scripts for gpuCI
Shell
1
star
71

ansible-roles

1
star
72

code-share

C++
1
star
73

build-metrics-reporter

Python
1
star
74

cibuildwheel-imgs

Dockerfile
1
star
75

gpuci-tools

User tools for use within the gpuCI environment
Shell
1
star
76

pynvjitlink

Python
1
star
77

rapids-dask-dependency

Shell
1
star
78

sphinx-theme

This repository contains a Sphinx theme used for RAPIDS documentation
CSS
1
star