• Stars
    star
    320
  • Rank 131,126 (Top 3 %)
  • Language
    C++
  • License
    Other
  • Created over 9 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

STREAM, for lots of devices written in many programming models

BabelStream

logo

CI

Measure memory transfer rates to/from global device memory on GPUs. This benchmark is similar in spirit, and based on, the STREAM benchmark [1] for CPUs.

Unlike other GPU memory bandwidth benchmarks this does not include the PCIe transfer time.

There are multiple implementations of this benchmark in a variety of programming models.

This code was previously called GPU-STREAM.

Table of Contents

Programming Models

BabelStream is currently implemented in the following parallel programming models, listed in no particular order:

  • OpenCL
  • CUDA
  • HIP
  • OpenACC
  • OpenMP 3 and 4.5
  • C++ Parallel STL
  • Kokkos
  • RAJA
  • SYCL and SYCL 2020
  • TBB
  • Thrust (via CUDA or HIP)

This project also contains implementations in alternative languages with different build systems:

How is this different to STREAM?

BabelStream implements the four main kernels of the STREAM benchmark (along with a dot product), but by utilising different programming models expands the platforms which the code can run beyond CPUs.

The key differences from STREAM are that:

  • the arrays are allocated on the heap
  • the problem size is unknown at compile time
  • wider platform and programming model support

With stack arrays of known size at compile time, the compiler is able to align data and issue optimal instructions (such as non-temporal stores, remove peel/remainder vectorisation loops, etc.). But this information is not typically available in real HPC codes today, where the problem size is read from the user at runtime.

BabelStream therefore provides a measure of what memory bandwidth performance can be attained (by a particular programming model) if you follow today's best parallel programming best practice.

BabelStream also includes the nstream kernel from the Parallel Research Kernels (PRK) project, available on GitHub. Details about PRK can be found in the following references:

  • Van der Wijngaart, Rob F., and Timothy G. Mattson. The parallel research kernels. IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2014.

  • R. F. Van der Wijngaart, A. Kayi, J. R. Hammond, G. Jost, T. St. John, S. Sridharan, T. G. Mattson, J. Abercrombie, and J. Nelson. Comparing runtime systems with exascale ambitions using the Parallel Research Kernels. ISC 2016, DOI: 10.1007/978-3-319-41321-1_17.

  • Jeff R. Hammond and Timothy G. Mattson. Evaluating data parallelism in C++ using the Parallel Research Kernels. IWOCL 2019, DOI: 10.1145/3318170.3318192.

Building

Drivers, compiler and software applicable to whichever implementation you would like to build against is required.

CMake

The project supports building with CMake >= 3.13.0, which can be installed without root via the official script.

Each BabelStream implementation (programming model) is built as follows:

$ cd babelstream

# configure the build, build type defaults to Release
# The -DMODEL flag is required
$ cmake -Bbuild -H. -DMODEL=<model> <model specific flags prefixed with -D...>

# compile
$ cmake --build build

# run executables in ./build
$ ./build/<model>-stream

The MODEL option selects one implementation of BabelStream to build. The source for each model's implementations are located in ./src/<model>.

Currently available models are:

omp;ocl;std;std20;hip;cuda;kokkos;sycl;sycl2020;acc;raja;tbb;thrust

Overriding default flags

By default, we have defined a set of optimal flags for known HPC compilers. There are assigned those to RELEASE_FLAGS, and you can override them if required.

To find out what flag each model supports or requires, simply configure while only specifying the model. For example:

> cd babelstream
> cmake -Bbuild -H. -DMODEL=ocl 
...
- Common Release flags are `-O3`, set RELEASE_FLAGS to override
-- CXX_EXTRA_FLAGS: 
        Appends to common compile flags. These will be used at link phase at well.
        To use separate flags at link time, set `CXX_EXTRA_LINKER_FLAGS`
-- CXX_EXTRA_LINK_FLAGS: 
        Appends to link flags which appear *before* the objects.
        Do not use this for linking libraries, as the link line is order-dependent
-- CXX_EXTRA_LIBRARIES: 
        Append to link flags which appears *after* the objects.
        Use this for linking extra libraries (e.g `-lmylib`, or simply `mylib`) 
-- CXX_EXTRA_LINKER_FLAGS: 
        Append to linker flags (i.e GCC's `-Wl` or equivalent)
-- Available models:  omp;ocl;std;std20;hip;cuda;kokkos;sycl;acc;raja;tbb
-- Selected model  :  ocl
-- Supported flags:

   CMAKE_CXX_COMPILER (optional, default=c++): Any CXX compiler that is supported by CMake detection
   OpenCL_LIBRARY (optional, default=): Path to OpenCL library, usually called libOpenCL.so
...

Alternatively, refer to the CI script, which test-compiles most of the models, and see which flags are used there.

It is recommended that you delete the build directory when you change any of the build flags.

GNU Make

Support for Make has been removed from 4.0 onwards. However, as the build process only involves a few source files, the required compile commands can be extracted from the CI output.

Results

Sample results can be found in the results subdirectory. Newer results are found in our Performance Portability repository.

Contributing

As of v4.0, the main branch of this repository will hold the latest released version.

The develop branch will contain unreleased features due for the next (major and/or minor) release of BabelStream. Pull Requests should be made against the develop branch.

Citing

Please cite BabelStream via this reference:

Deakin T, Price J, Martineau M, McIntosh-Smith S. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. Special issue. Vol. 17, No. 3, pp. 247โ€“262. 2018. DOI: 10.1504/IJCSE.2018.095847

Other BabelStream publications

  • Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany. DOI: 10.1007/978- 3-319-46079-6_34

  • Deakin T, McIntosh-Smith S. GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units. 2015. Poster session presented at IEEE/ACM SuperComputing, Austin, United States. You can view the Poster and Extended Abstract.

  • Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM: Now in 2D!. 2016. Poster session presented at IEEE/ACM SuperComputing, Salt Lake City, United States. You can view the Poster and Extended Abstract.

  • Raman K, Deakin T, Price J, McIntosh-Smith S. Improving achieved memory bandwidth from C++ codes on Intel Xeon Phi Processor (Knights Landing). IXPUG Spring Meeting, Cambridge, UK, 2017.

  • Deakin T, Price J, McIntosh-Smith S. Portable methods for measuring cache hierarchy performance. 2017. Poster sessions presented at IEEE/ACM SuperComputing, Denver, United States. You can view the Poster and Extended Abstract

[1]: McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

More Repositories

1

openmp-tutorial

Exercises and Solutions for "Programming Your GPU with OpenMP: A Hands-On Introduction"
C
116
star
2

SimEng

The University of Bristol HPC Simulation Engine
C++
81
star
3

hpc-course-getting-started

A tutorial-style introduction to the HPC course
47
star
4

TSVC_2

Updated C version of the Test Suite for Vectorising Compilers
C
46
star
5

hpc-course-examples

Examples for HPC course
C
37
star
6

advanced-hpc-lbm

COMS30006 - Advanced High Performance Computing - Lattice Boltzmann
C
28
star
7

benchmarks

Scripts for running various benchmarks on Isambard and other systems.
Shell
28
star
8

openmp-for-cs

OpenMP for Computational Scientists training materials
TeX
24
star
9

miniBUDE

A BUDE virtual-screening benchmark, in many programming models
C++
24
star
10

TeaLeaf

A C++based implementation of the TeaLeaf heat conduction mini-app. This implementation of TeaLeaf replicates the functionality of the reference version of TeaLeaf (https://github.com/UK-MAC/TeaLeaf_ref).
C++
22
star
11

ipu-hpc-cookbook

Useful tutorials and recipes for developers doing low-level work with the Graphcore IPU
C++
21
star
12

neutral

A Monte Carlo Neutron Transport Mini-App
Smalltalk
15
star
13

hal3d

A 3D multi-material Arbitrary Lagrangian-Eulerian hydrocode
C
14
star
14

performance-portability

Data and reproducibility scripts for the UoB-HPC Performance Portability studies
Jupyter Notebook
14
star
15

minifmm

C++
11
star
16

intro-hpc-stencil

Starting code for Introduction to High Performance Computing
C
10
star
17

opencl-training-code

Exercise and Solution code for OpenCL training (Intro and Advanced)
C
8
star
18

opencl-training-slides

PowerPoint Slides for the OpenCL training (includes Intro and Advanced)
8
star
19

cachebw

C
7
star
20

utpx

UTPX (Userspace Transparent Paging Extension) is a proof-of-concept LD_PRELOAD library that accelerates HIP managed allocations on systems without XNACK or with XNACK disabled.
C++
6
star
21

UoB-HPC.github.io

UoB HPC Website: Please visit the fully rendered page
SCSS
5
star
22

PortableFP128

Shim header to provide portable access to IEEE 128b floats in C.
C
5
star
23

intro-hpc-jacobi

COMS30005 - Introduction to High Performance Computing - Jacobi solver
C
4
star
24

CloverLeaf-OLD_DONOTUSE

C++
4
star
25

stdpar-mandelbrot

C++
4
star
26

cloverleaf_sycl

C++
4
star
27

pbf-sph

C++
4
star
28

SNAP-Kokkos

A Kokkos port of the SNAP application
C++
4
star
29

GW4-Isambard

DEPRECATED: Use https://gw4-isambard.github.io instead
4
star
30

advanced-hpc-examples

Example programs for Kokkos, OpenCL and OpenMP 4.5
C
4
star
31

minicombust_app

C++
3
star
32

CloverLeaf

CloverLeaf, in many programming models
C++
3
star
33

hot

A heat diffusion mini-app that uses a CG solver
C
3
star
34

flow

A hydrodynamics mini-app
C
3
star
35

abft-sparse-cg

Simple sparse matrix CG solver for experimenting with ABFT techniques
C++
3
star
36

SNAP-OpenMP

Creating an OpenMP implementation of SNAP
Fortran
3
star
37

sycl_dgemm

C++
3
star
38

SNAP_MPI_OpenCL

C
3
star
39

minicombust

An exascale mini-app for investigating simulations of combustion in gas turbines
3
star
40

everythingsreduced

This is a collection of key reduction kernel patterns collated from other benchmarks.
Jupyter Notebook
3
star
41

scaling-ml-approaches-to-amg

Jupyter Notebook
2
star
42

zoo

Documentation for HPC Zoo Cluster
Python
2
star
43

neutral_kokkos

Smalltalk
2
star
44

stdpar-nbody

C++
1
star
45

gpusched-benchmarks

C++
1
star
46

microBUDE

C++
1
star
47

gpusched

C
1
star
48

SNAP-OpenACC

An OpenACC implementation of SNAP
Fortran
1
star
49

rtb

Scala
1
star
50

CloverLeaf-OpenACC

OpenACC implementation of CloverLeaf using CUDA aware MPI
Fortran
1
star
51

SNAP_MPI_CUDA

C
1
star
52

cloverleaf_stdpar

C++
1
star
53

arch

A shared architectural code for a suite of mini-apps
C
1
star
54

SNAP-OpenCL

HTML
1
star
55

locally

Shell
1
star
56

sc15-tutorial

Exercises and Solutions for "Portable Programs for Heterogeneous Computing" tutorial at SC'15.
C
1
star
57

Application_BLAS_Usage

Results and graphs from tracing BLAS usage in widely used HPC applications
1
star
58

Arm-RISCV-Empirical-Comparison-Artifact

C
1
star
59

cloverleaf_kokkos

Direct port of CloverLeaf_ref to Kokkos
C++
1
star
60

structured-grids-stencils-on-ipu

The code for our paper 'Using the Graphcore IPU for traditional HPC applications' in which we describe the first implementation of stencils on structured grids for the Intelligence Processing Unit
C++
1
star
61

cache-effects-reproducibility

Reproducibility data and scripts for the EAHPC-2020 paper "The Effects of Wide Vector Operations on Processor Caches"
Python
1
star