• Stars
    star
    157
  • Rank 238,482 (Top 5 %)
  • Language
    C++
  • License
    MIT License
  • Created almost 7 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ROCm Parallel Primitives

rocPRIM

The rocPRIM is a header-only library providing HIP parallel primitives for developing performant GPU-accelerated code on AMD ROCm platform.

Requirements

  • Git
  • CMake (3.16 or later)
  • AMD ROCm platform (1.8.2 or later)
  • C++14
  • Python 3.6 or higher (HIP on Windows only, required only for install script)
  • Visual Studio 2019 with clang support (HIP on Windows only)
  • Strawberry Perl (HIP on Windows only)

Optional:

  • GTest
    • Required only for tests. Building tests is enabled by default.
    • It will be automatically downloaded and built by cmake script.
  • Google Benchmark
    • Required only for benchmarks. Building benchmarks is off by default.
    • It will be automatically downloaded and built by cmake script.

Build and Install

Linux

git clone https://github.com/ROCmSoftwarePlatform/rocPRIM.git

# Go to rocPRIM directory, create and go to the build directory.
cd rocPRIM; mkdir build; cd build

# Configure rocPRIM, setup options for your system.
# Build options:
#   ONLY_INSTALL - OFF by default, If this flag is on, the build ignore the BUILD_* flags
#   BUILD_TEST - OFF by default,
#   BUILD_EXAMPLE - OFF by default,
#   BUILD_BENCHMARK - OFF by default.
#   BENCHMARK_CONFIG_TUNING - OFF by default. The purpose of this flag to find the best kernel config parameters.
#     At ON the compilation time can be increased significantly.
#   AMDGPU_TARGETS - list of AMD architectures, default: gfx803;gfx900;gfx906;gfx908.
#     You can make compilation faster if you want to test/benchmark only on one architecture,
#     for example, add -DAMDGPU_TARGETS=gfx906 to 'cmake' parameters.
#   AMDGPU_TEST_TARGETS - list of AMD architectures, default: "" (default system device)
#     If you want to detect failures on a per GFX IP basis, setting it to some set of ips will create
#     separate tests with the ip name embedded into the test name. Building for all, but selecting
#     tests only of a specific architecture is possible for eg: ctest -R gfx803|gfx900
#
# ! IMPORTANT !
# Set C++ compiler to HIP-clang. You can do it by adding 'CXX=<path-to-compiler>'
# before 'cmake' or setting cmake option 'CMAKE_CXX_COMPILER' to path to the compiler.
# Using HIP-clang:
[CXX=hipcc] cmake -DBUILD_BENCHMARK=ON ../.
#
# ! EXPERIMENTAL !
# Alternatively one may build using the experimental (and highly incomplete) HIP-CPU back-end for host-side
# execution using any C++17 conforming compiler (supported by HIP-CPU). AMDGPU_* options are unavailable in this case. 
#   USE_HIP_CPU - OFF by default

# Build
make -j4

# Optionally, run tests if they're enabled.
ctest --output-on-failure

# Install
[sudo] make install

Windows

Initial support for HIP on Windows has been added. To install, use the provided rmake.py python script:

git clone https://github.com/ROCmSoftwarePlatform/rocPRIM.git
cd rocPRIM

# the -i option will install rocPRIM to C:\hipSDK by default
python rmake.py -i

# the -c option will build all clients including unit tests
python rmake.py -c

Using rocPRIM

Include <rocprim/rocprim.hpp> header:

#include <rocprim/rocprim.hpp>

Recommended way of including rocPRIM into a CMake project is by using its package configuration files. rocPRIM package name is rocprim.

# "/opt/rocm" - default install prefix
find_package(rocprim REQUIRED CONFIG PATHS "/opt/rocm/rocprim")

...

# Includes only rocPRIM headers, HIP libraries have
# to be linked manually by user
target_link_libraries(<your_target> roc::rocprim)

# Includes rocPRIM headers and required HIP dependencies
target_link_libraries(<your_target> roc::rocprim_hip)

Running Unit Tests

Unit tests are implemented in terms of Google Test and collections of tests are wrapped to be invoked from CTest for convenience.

# Go to rocPRIM build directory
cd rocPRIM; cd build

# List available tests
ctest --show-only

# To run all tests
ctest

# Run specific test(s)
ctest -R <regex>

# To run the Google Test manually
./test/rocprim/test_<unit-test-name>

Using multiple GPUs concurrently for testing

This feature requires CMake 3.16+ to be used for building / testing. (Prior versions of CMake cannot assign ids to tests when running in parallel. Assigning tests to distinct devices could only be done at the cost of extreme complexity.)

The unit tests can make use of CTest Resource Allocation feature enabling distributing tests across multiple GPUs in an intelligent manner. The feature can accelerate testing when multiple GPUs of the same family are in a system as well as test multiple family of products from one invocation without having to resort to HIP_VISIBLE_DEVICES environment variable. The feature relies on the presence of a resource spec file.

IMPORTANT: trying to use RESOURCE_GROUPS and --resource-spec-file with CMake/CTest respectively of versions prior to 3.16 omits the feature silently. No warnings issued about unknown properties or command-line arguments. Make sure that cmake/ctest invoked are sufficiently recent.

Auto resource spec generation

There is a utility script in the repo that may be called independently:

# Go to rocPRIM build directory
cd rocPRIM; cd build

# Invoke directly or use CMake script mode via cmake -P
../cmake/GenerateResourceSpec.cmake

# Assuming you have 2 compatible GPUs in the system
ctest --resource-spec-file ./resources.json --parallel 2

Manual

Assuming the user has 2 GPUs from the gfx900 family and they are the first devices enumerated by the system one may specify during configuration -D AMDGPU_TEST_TARGETS=gfx900 stating only one family will be tested. Leaving this var empty (default) results in targeting the default device in the system. To let CMake know there are 2 GPUs that should be targeted, one has to feed CTest a JSON file via the --resource-spec-file <path_to_file> flag. For example:

{
  "version": {
    "major": 1,
    "minor": 0
  },
  "local": [
    {
      "gfx900": [
        {
          "id": "0"
        },
        {
          "id": "1"
        }
      ]
    }
  ]
}

Invoking CTest as ctest --resource-spec-file <path_to_file> --parallel 2 will allow two tests to run concurrently which will be distributed among the two GPUs.

Using custom seeds for the tests

Go to the rocPRIM/test/rocprim/test_seed.hpp file.

//(1)
static constexpr int random_seeds_count = 10;

//(2)
static constexpr unsigned int seeds [] = {0, 2, 10, 1000};

//(3)
static constexpr size_t seed_size = sizeof(seeds) / sizeof(seeds[0]);

(1) defines a constant that sets how many passes over the tests will be done with runtime-generated seeds. Modify at will.

(2) defines the user generated seeds. Each of the elements of the array will be used as seed for all tests. Modify at will. If no static seeds are desired, the array should be left empty.

static constexpr unsigned int seeds [] = {};

(3) this line should never be modified.

Running Benchmarks

# Go to rocPRIM build directory
cd rocPRIM; cd build

# To run benchmark for warp functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_warp_<function_name> [--size <size>] [--trials <trials>]

# To run benchmark for block functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_block_<function_name> [--size <size>] [--trials <trials>]

# To run benchmark for device functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_device_<function_name> [--size <size>] [--trials <trials>]

Performance configuration

Most of device-wide primitives provided by rocPRIM can be tuned for different AMD device, different types or different operations using compile-time configuration structures passed to them as a template parameter. Main "knobs" are usually size of the block and number of items processed by a single thread.

rocPRIM has built-in default configurations for each of its primitives. In order to use included configurations user should define macro ROCPRIM_TARGET_ARCH to 803 if algorithms should be optimized for gfx803 GCN version, or to 900 for gfx900.

Documentation

The latest rocPRIM documentation and API description can be found here.

It can also be built using the following commands:

# Go to rocPRIM docs directory
cd rocPRIM; cd docs

# Install Python dependencies
python3 -m pip install -r .sphinx/requirements.txt

# Build the documentation
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

# For e.g. serve the HTML docs locally
cd _build/html
python3 -m http.server

hipCUB

hipCUB is a thin wrapper library on top of rocPRIM or CUB. It enables developers to port project that uses CUB library to the HIP layer and to run them on AMD hardware. In ROCm environment hipCUB uses rocPRIM library as the backend, however, on CUDA platforms it uses CUB instead.

Support

Bugs and feature requests can be reported through the issue tracker.

Contributions and License

Contributions of any kind are most welcome! More details are found at CONTRIBUTING and LICENSE.

More Repositories

1

ROCm

AMD ROCmâ„¢ Software - GitHub Home
Shell
4,583
star
2

HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
C++
3,398
star
3

MIOpen

AMD's Machine Intelligence Library
Assembly
1,060
star
4

HIPIFY

HIPIFY: Convert CUDA to Portable C++ Code
C++
505
star
5

hcc

HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
C++
425
star
6

rocBLAS

Next generation BLAS implementation for ROCm platform
C++
308
star
7

composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
C++
285
star
8

omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
C++
283
star
9

rccl

ROCm Communication Collectives Library (RCCL)
C++
231
star
10

ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
C++
217
star
11

Tensile

Stretching GPU performance for GEMMs and tensor contractions.
Python
214
star
12

aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
Fortran
203
star
13

AMDMIGraphX

AMD's graph optimization engine.
C++
185
star
14

rocFFT

Next generation FFT implementation for ROCm
C++
174
star
15

MIVisionX

MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVXâ„¢ and OpenVXâ„¢ Extensions.
C++
168
star
16

gpufort

GPUFORT: S2S translation tool for CUDA Fortran and Fortran+X in the spirit of hipify
Fortran
159
star
17

rocm-examples

A collection of examples for the ROCm software stack
C++
154
star
18

omniperf

Advanced Profiling and Analytics for AMD Hardware
Python
132
star
19

rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
C
126
star
20

rocMLIR

C++
120
star
21

rocSPARSE

Next generation SPARSE implementation for ROCm platform
C++
117
star
22

rocm_smi_lib

ROCm SMI LIB
C++
116
star
23

rocRAND

RAND library for HIP programming language
C++
111
star
24

HIP-CPU

An implementation of HIP that works on CPUs, across OSes.
C++
107
star
25

rocThrust

ROCm Thrust - run Thrust dependent software on AMD GPUs
C++
100
star
26

ROCm-Device-Libs

ROCm Device Libraries
C
97
star
27

rocSOLVER

Next generation LAPACK implementation for ROCm platform
C++
91
star
28

rocWMMA

rocWMMA
C++
86
star
29

hipCUB

Reusable software components for ROCm developers
C++
81
star
30

rocALUTION

Next generation library for iterative sparse solvers for ROCm platform
C++
74
star
31

hipfort

Fortran interfaces for ROCm libraries
Fortran
69
star
32

roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
C++
69
star
33

hipSPARSE

ROCm SPARSE marshalling library
C++
67
star
34

atmi

Asynchronous Task and Memory Interface, or ATMI, is a runtime framework and programming model for heterogeneous CPU-GPU systems. It provides a consistent, declarative API to create task graphs on CPUs and GPUs (integrated and discrete).
C++
66
star
35

ROCmValidationSuite

The ROCm Validation Suite is a system administrator’s and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.
C++
61
star
36

rocm-cmake

CMake modules used within the ROCm libraries
CMake
59
star
37

hipFFT

hipFFT is a FFT marshalling library.
C++
52
star
38

ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
C
50
star
39

amd_matrix_instruction_calculator

A tool for generating information about the matrix multiplication instructions in AMD Radeonâ„¢ and AMD Instinctâ„¢ accelerators
Python
48
star
40

ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
C++
46
star
41

rpp

AMD ROCm Performance Primitives (RPP) library is a comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends.
C++
46
star
42

ROCclr

44
star
43

rocm_bandwidth_test

Bandwidth test for ROCm
C++
41
star
44

amdsmi

AMD SMI
C++
39
star
45

HIPCC

HIPCC: HIP compiler driver
C++
39
star
46

aotriton

Ahead of Time (AOT) Triton Math Library
Python
37
star
47

Experimental_ROC

Experimental and Intriguing Tools for ROCm
Shell
35
star
48

rocHPCG

HPCG benchmark based on ROCm platform
C++
35
star
49

ROC_SHMEM

ROC_SHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
C++
34
star
50

MISA

Machine Intelligence Shader Autogen. AMDGPU ML shader code generator. (previously iGEMMgen)
Python
34
star
51

ROCm.github.io

ROCm Website
32
star
52

TransferBench

TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)
C++
29
star
53

rocm-blogs

Jupyter Notebook
28
star
54

clang-ocl

OpenCL compilation with clang compiler.
CMake
26
star
55

hipSOLVER

ROCm SOLVER marshalling library
C++
24
star
56

ROCm-OpenCL-Driver

ROCm OpenCL Compiler Tool Driver
C++
24
star
57

rdc

RDC
C++
23
star
58

hipRAND

Random number library that generate pseudo-random and quasi-random numbers.
C++
23
star
59

rccl-tests

RCCL Performance Benchmark Tests
Cuda
21
star
60

ROCdbgapi

The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures.
C++
19
star
61

pyrsmi

python package of rocm-smi-lib
Python
18
star
62

hip-python

HIP Python Low-level Bindings
Shell
17
star
63

hip-tests

C++
15
star
64

roc-stdpar

C++
14
star
65

pytorch-micro-benchmarking

Python
14
star
66

hipify_torch

Python
13
star
67

rocmProfileData

C++
13
star
68

rocm-docs-core

ROCm Documentation Python package for ReadTheDocs build standardization
CSS
12
star
69

rocAL

The AMD rocAL is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user.
C++
11
star
70

half

C++
9
star
71

rocprofiler-sdk

C++
9
star
72

rocBLAS-Examples

Examples illustrating usage of the rocBLAS library
C++
9
star
73

OSU_Microbenchmarks

ROCm - UCX enabled OSU_Benchmarks
C
8
star
74

MITuna

Python
7
star
75

rtg_tracer

C++
7
star
76

Gromacs

ROCm's implementation of Gromacs
C++
6
star
77

rocm-spack-pkgs

Repository to host spack recipes for ROCm
Python
6
star
78

rbuild

Rocm build tool
Python
6
star
79

rocm-core

CMake
5
star
80

rocm-llvm-python

Low-level Cython and Python bindings to the (ROCm) LLVM and AMD COMGR C API. Also ships the official LLVM Clang bindings.
Shell
4
star
81

hip-testsuite

Python
4
star
82

MIFin

Tuna centric MIOpen client
C++
4
star
83

flang

Mirror of flang repo: The source repo is https://github.com/flang-compiler/flang . Once a day the master branch is updated from the upstream source repo and then locked. AOMP or ROCm developers may commit or create PRs on branch aomp-dev.
C++
3
star
84

numba-hip

HIP backend patch for Numba, the NumPy aware dynamic Python compiler using LLVM.
Python
3
star
85

tensorcast

Python
3
star
86

hipSPARSELt

C++
2
star
87

aomp-extras

hostcall services library, math library, and utilities
Shell
2
star
88

MIOpenExamples

MIOpen examples
C++
2
star
89

rocprofiler-register

CMake
2
star
90

rocm-install-on-windows

2
star
91

hipOMB

OSU MPI benchmarks with ROCm support
C
1
star
92

migraphx-benchmark

1
star
93

rocm-recipes

Recipes for rocm
CMake
1
star
94

hipBLAS-common

Common files shared by hipBLAS and hipBLASLt
CMake
1
star