• Stars
    star
    48
  • Rank 575,201 (Top 12 %)
  • Language
    C
  • Created over 6 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MPI benchmark to test and measure collective performance

mpiBench

Times MPI collectives over a series of message sizes

What is mpiBench?

mpiBench.c

This program measures MPI collective performance for a range of message sizes. The user may specify:

  • the collective to perform,
  • the message size limits,
  • the number of iterations to perform,
  • the maximum memory a process may allocate for MPI buffers,
  • the maximum time permitted for a given test,
  • and the number of Cartesian dimensions to divide processes into.

The default behavior of mpiBench will run from 0-256K byte messages for all supported collectives on MPI_COMM_WORLD with a 1G buffer limit. Each test will execute as many iterations as it can to fit within a default time limit of 50000 usecs.

crunch_mpiBench

This is a perl script which can be used to filter data and generate reports from mpiBench output files. It can merge data from multiple mpiBench output files into a single report. It can also filter output to a subset of collectives. By default, it reports the operation duration time (i.e., how long the collective took to complete). For some collectives, it can also report the effective bandwidth. If provided two datasets, it computes a speedup factor.

What is measured

mpiBench measures the total time required to iterate through a loop of back-to-back invocations of the same collective (optionally separated by a barrier), and divides by the number of iterations. In other words the timing kernel looks like the following:

time_start = timer();
for (i=0 ; i < iterations; i++) {
  collective(msg_size);
  barrier();
}
time_end = timer();
time = (time_end - time_start) / iterations;

Each participating MPI process performs this measurement and all report their times. It is the average, minimum, and maximum across this set of times which is reported.

Before the timing kernel is started, the collective is invoked once to prime it, since the initial call may be subject to overhead that later calls are not. Then, the collective is timed across a small set of iterations (~5) to get a rough estimate for the time required for a single invocation. If the user specifies a time limit using the -t option, this esitmate is used to reduce the number of iterations made in the timing kernel loop, as necessary, so it may executed within the time limit.

Basic Usage

Build:

make

Run:

srun -n <procs> ./mpiBench > output.txt

Analyze:

crunch_mpiBench output.txt

Build Instructions

There are several make targets available:

  • make -- simple build
  • make nobar -- build without barriers between consecutive collective invocations
  • make debug -- build with "-g -O0" for debugging purposes
  • make clean -- clean the build

If you'd like to build manually without the makefiles, there are some compile-time options that you should be aware of:

-D NO_BARRIER - drop barrier between consecutive collective invocations -D USE_GETTIMEOFDAY - use gettimeofday() instead of MPI_Wtime() for timing info

Usage Syntax

Usage:  mpiBench [options] [operations]

Options:
  -b <byte>  Beginning message size in bytes (default 0)
  -e <byte>  Ending message size in bytes (default 1K)
  -i <itrs>  Maximum number of iterations for a single test
             (default 1000)
  -m <byte>  Process memory buffer limit (send+recv) in bytes
             (default 1G)
  -t <usec>  Time limit for any single test in microseconds
             (default 0 = infinity)
  -d <ndim>  Number of dimensions to split processes in
             (default 0 = MPI_COMM_WORLD only)
  -c         Check receive buffer for expected data in last
             interation (default disabled)
  -C         Check receive buffer for expected data every
             iteration (default disabled)
  -h         Print this help screen and exit
  where <byte> = [0-9]+[KMG], e.g., 32K or 64M

Operations:
  Barrier
  Bcast
  Alltoall, Alltoallv
  Allgather, Allgatherv
  Gather, Gatherv
  Scatter
  Allreduce
  Reduce

Examples

mpiBench

Run the default set of tests:

srun -n2 -ppdebug mpiBench

Run the default message size range and iteration count for Alltoall, Allreduce, and Barrier:

srun -n2 -ppdebug mpiBench Alltoall Allreduce Barrier

Run from 32-256 bytes and time across 100 iterations of Alltoall:

srun -n2 -ppdebug mpiBench -b 32 -e 256 -i 100 Alltoall

Run from 0-2K bytes and default iteration count for Gather, but reduce the iteration count, as necessary, so each message size test finishes within 100,000 usecs:

srun -n2 -ppdebug mpiBench -e 2K -t 100000 Gather

crunch_mpiBench

Show data for just Alltoall:

crunch_mpiBench -op Alltoall out.txt

Merge data from several files into a single report:

crunch_mpiBench out1.txt out2.txt out3.txt

Display effective bandwidth for Allgather and Alltoall:

crunch_mpiBench -bw -op Allgather,Alltoall out.txt

Compare times in output files in dir1 with those in dir2:

crunch_mpiBench -data DIR1_DATA dir1/* -data DIR2_DATA dir2/*

Additional Notes

Rank 0 always acts as the root process for collectives which involve a root.

If the minimum and maximum are quite different, then some processes may be escaping ahead to start later iterations before the last one has completely finished. In this case, one may use the maximum time reported or insert a barrier between consecutive invocations (build with "make" instead of "make nobar") to syncronize the processes.

For Reduce and Allreduce, vectors of doubles are added, so message sizes of 1, 2, and 4-bytes are skipped.

Two available make commands build mpiBench with test kernels like the following:

   "make"              "make nobar"
start=timer()        start=timer()
for(i=o;i<N;i++)     for(i=o;i<N;i++)
{                    {
  MPI_Gather()         MPI_Gather()
  MPI_Barrier()
}                    }
end=timer()          end=timer()
time=(end-start)/N   time=(end-start)/N

"make nobar" may allow processes to escape ahead, but does not include cost of barrier.

More Repositories

1

zfp

Compressed numerical arrays that support high-speed random access
C++
668
star
2

sundials

Official development repository for SUNDIALS - a SUite of Nonlinear and DIfferential/ALgebraic equation Solvers. Pull requests are welcome for bug fixes and minor changes.
C
454
star
3

RAJA

RAJA Performance Portability Layer (C++)
C++
431
star
4

Caliper

Caliper is an instrumentation and performance profiling library
C++
318
star
5

Umpire

An application-focused API for memory management on NUMA & GPU architectures
C++
300
star
6

blt

A streamlined CMake build system foundation for developing HPC software
C++
242
star
7

lbann

Livermore Big Artificial Neural Network Toolkit
C++
219
star
8

SAMRAI

Structured Adaptive Mesh Refinement Application Infrastructure - a scalable C++ framework for block-structured AMR application development
C++
213
star
9

hiop

HPC solver for nonlinear optimization problems
C++
205
star
10

libROM

Model reduction library with an emphasis on large scale parallelism and linear subspace methods
C++
189
star
11

HPC-Tutorials

Future home of hpc-tutorials.llnl.gov
C
188
star
12

magpie

Magpie contains a number of scripts for running Big Data software in HPC environments, including Hadoop and Spark. There is support for Lustre, Slurm, Moab, Torque. LSF, Flux, and more.
Shell
188
star
13

conduit

Simplified Data Exchange for HPC Simulations
C++
179
star
14

units

A run-time C++ library for working with units of measurement and conversions between them and with string representations of units and measurements
C++
128
star
15

maestrowf

A tool to easily orchestrate general computational workflows both locally and on supercomputers
Python
126
star
16

serac

Serac is a high order nonlinear thermomechanical simulation code
C++
120
star
17

merlin

Machine Learning for HPC Workflows
Python
115
star
18

axom

CS infrastructure components for HPC applications
C++
110
star
19

ior

Parallel filesystem I/O benchmark
C
105
star
20

cowc

Cars Overhead With Context related scripts described in Mundhenk et al. 2016 ECCV.
Python
104
star
21

CHAI

Copy-hiding array abstraction to automatically migrate data between memory spaces
C++
101
star
22

UnifyFS

UnifyFS: A file system for burst buffers
C
96
star
23

scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
C
96
star
24

LULESH

Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)
C++
92
star
25

RAJAPerf

RAJA Performance Suite
C++
90
star
26

umap

User-space Page Management
C++
88
star
27

shroud

Shroud: generate Fortran and Python wrappers for C and C++ libraries
C++
87
star
28

MacPatch

Software & Patch management for macOS
Objective-C
86
star
29

FAST

Fusion models for Atomic and molecular STructures (FAST)
Python
85
star
30

msr-safe

Allows safer access to model specific registers (MSRs)
C
83
star
31

Aluminum

High-performance, GPU-aware communication library
C++
82
star
32

yorick

yorick interpreted language
C
76
star
33

fpzip

Lossless compressor of multidimensional floating-point arrays
C++
75
star
34

camp

Compiler agnostic metaprogramming library providing concepts, type operations and tuples for C++ and cuda
C++
72
star
35

mpiP

A light-weight MPI profiler.
C
68
star
36

dataracebench

Data race benchmark suite for evaluating OpenMP correctness tools aimed to detect data races.
C
66
star
37

GOTCHA

GOTCHA is a library for wrapping function calls in shared libraries
C
64
star
38

STAT

STAT - the Stack Trace Analysis Tool
C
62
star
39

lmt

Lustre Monitoring Tools
C
62
star
40

Elemental

Distributed-memory, arbitrary-precision, dense and sparse-direct linear algebra, conic optimization, and lattice reduction
C++
61
star
41

variorum

Vendor-neutral library for exposing power and performance features across diverse architectures
C++
59
star
42

spheral

C++
56
star
43

pyranda

A Python driven, Fortran powered Finite Difference solver for arbitrary hyperbolic PDE systems. This is the mini-app for the Miranda code.
Fortran
56
star
44

lustre

LLNL's branches of Lustre
C
55
star
45

pylibROM

Python interface for libROM, library for reduced order models
Python
52
star
46

libmsr

Wrapper library for model-specific registers. APIs cover RAPL, performance counters, clocks and turbo.
C
51
star
47

metall

Persistent memory allocator for data-centric analytics
C++
50
star
48

cardioid

Cardiac simulation toolkit.
C++
49
star
49

scraper

Python library for getting metadata from source code hosting tools
Python
49
star
50

llnl.github.io

Public home for LLNL software catalog
JavaScript
48
star
51

Abmarl

Agent Based Modeling and Reinforcement Learning
Python
47
star
52

H5Z-ZFP

A registered ZFP compression plugin for HDF5
C
47
star
53

ExaCA

Cellular automata code for alloy nucleation and solidification written with Kokkos
C++
46
star
54

mttime

Time Domain Moment Tensor Inversion in Python
Python
45
star
55

qball

Qball (also known as qb@ll) is a first-principles molecular dynamics code that is used to compute the electronic structure of atoms, molecules, solids, and liquids within the Density Functional Theory (DFT) formalism. It is a fork of the Qbox code by Francois Gygi.
C++
43
star
56

Juqbox.jl

Juqbox.jl solves quantum optimal control problems in closed quantum systems
Julia
42
star
57

quandary

Optimal control for open quantum systems
C++
42
star
58

unum

Universal Number Library
C
40
star
59

LaSDI

Jupyter Notebook
40
star
60

GridDyn

GridDyn is an open-source power transmission simulation software package
C++
40
star
61

fastcam

A toolkit for efficent computation of saliency maps for explainable AI attribution. This tool was developed at Lawrence Livermore National Laboratory.
Jupyter Notebook
39
star
62

DJINN

Deep jointly-informed neural networks -- as easy-to-use algorithm for designing/initializing neural nets
Python
39
star
63

CxxPolyFit

A simple library for producing multidimensional polynomial fits for C++
Fortran
37
star
64

ExaConstit

A crystal plasticity FEM code that runs on the GPU
C++
36
star
65

acrotensor

A C++ library for computing large scale tensor contractions.
C++
34
star
66

zero-rk

Zero-order Reaction Kinetics (Zero-RK) is a software package that simulates chemically reacting systems in a computationally efficient manner.
C++
33
star
67

wrap

MPI wrapper generator, for writing PMPI tool libraries
Python
33
star
68

mgmol

MGmol is a scalable O(N) First-Principles Molecular Dynamics code that is capable of performing large-scale electronics structure calculations and molecular dynamics simulations of atomistic systems.
C++
33
star
69

cruise

User space POSIX-like file system in main memory
C
32
star
70

ddcMD

A fully GPU-accelerated molecular dynamics program for the Martini force field
C
32
star
71

Quicksilver

A proxy app for the Monte Carlo Transport Code, Mercury. LLNL-CODE-684037
C++
32
star
72

MACSio

A Multi-purpose, Application-Centric, Scalable I/O Proxy Application
C
32
star
73

Kripke

Kripke is a simple, scalable, 3D Sn deterministic particle transport code
C++
31
star
74

UEDGE

2D fluid simulation of plasma and neutrals in magnetic fusion devices
Mathematica
30
star
75

FGPU

Fortran
30
star
76

graphite

A repository for implementing graph network models based on atomic structures.
Jupyter Notebook
30
star
77

CallFlow

Visualization tool for analyzing call trees and graphs
Vue
29
star
78

AMPE

Adaptive Mesh Phase-field Evolution
C++
29
star
79

burstfs

C
27
star
80

FPChecker

A dynamic analysis tool to detect floating-point errors in HPC applications.
Python
27
star
81

ravel

Ravel MPI trace visualization tool
C++
27
star
82

ygm

C++
27
star
83

mpibind

Pragmatic, Productive, and Portable Affinity for HPC
C
27
star
84

CARE

CHAI and RAJA provide an excellent base on which to build portable codes. CARE expands that functionality, adding new features such as loop fusion capability and a portable interface for many numerical algorithms. It provides all the basics for anyone wanting to write portable code.
C++
27
star
85

AMG

Algebraic multigrid benchmark
C
26
star
86

gLaSDI

Python
26
star
87

uberenv

Automates using spack to build and deploy software
Shell
25
star
88

havoqgt

C++
25
star
89

benchpark

An open collaborative repository for reproducible specifications of HPC benchmarks and cross site benchmarking environments
Python
24
star
90

Silo

Mesh and Field I/O Library and Scientific Database
C
24
star
91

mpiGraph

MPI benchmark to generate network bandwidth images
Perl
24
star
92

muster

Massively Scalable Clustering
C++
23
star
93

cram

Tool to run many small MPI jobs inside of one large MPI job.
Python
23
star
94

SoRa

SoRa uses genetic programming to find mathematical representations from experimental data
Python
23
star
95

Task-Time-Tracker

A client side web app for tracking your time
JavaScript
23
star
96

apollo

Apollo: Online Machine Learning for Performance Portability
C++
22
star
97

MemAxes

Interactive Visualization of Memory Access Samples
C++
22
star
98

csld

Compressive sensing lattice dynamics
Python
22
star
99

MultiscaleTopOpt

A 3D multsicale topology optimization code using surrogate models of lattice microscale response
MATLAB
22
star
100

inq

This is a mirror. Please check our main website on gitlab.
C++
22
star