• Stars
    star
    100
  • Rank 330,145 (Top 7 %)
  • Language
    C++
  • License
    MIT License
  • Created almost 2 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ROCm Examples

A collection of examples to enable new users to start using ROCm. Advanced users may learn about new functionality through our advanced examples.

Repository Contents

  • AI Showcases the functionality for executing quantized models using Torch-MIGraphX.
  • Applications groups a number of examples ... .
    • bitonic_sort: Showcases how to order an array of $n$ elements using a GPU implementation of the bitonic sort.
    • convolution: A simple GPU implementation for the calculation of discrete convolutions.
    • floyd_warshall: Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs.
    • histogram: Histogram over a byte array with memory bank optimization.
    • monte_carlo_pi: Monte Carlo estimation of $\pi$ using hipRAND for random number generation and hipCUB for evaluation.
    • prefix_sum: Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm.
  • Common contains common utility functionality shared between the examples.
  • HIP-Basic hosts self-contained recipes showcasing HIP runtime functionality.
    • assembly_to_executable: Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code.
    • bandwidth: Program that measures memory bandwidth from host to device, device to host, and device to device.
    • bit_extract: Program that showcases how to use HIP built-in bit extract.
    • device_globals: Show cases how to set global variables on the device from the host.
    • device_query: Program that showcases how properties from the device may be queried.
    • dynamic_shared: Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel.
    • events: Measuring execution time and synchronizing with HIP events.
    • gpu_arch: Program that showcases how to implement GPU architecture-specific code.
    • hello_world: Simple program that showcases launching kernels and printing from the device.
    • hipify: Simple program and build definitions that showcase automatically converting a CUDA .cu source into portable HIP .hip source.
    • llvm_ir_to_executable: Shows how to create a HIP executable from LLVM IR.
    • inline_assembly: Program that showcases how to use inline assembly in a portable manner.
    • matrix_multiplication: Multiply two dynamically sized matrices utilizing shared memory.
    • module_api: Shows how to load and execute a HIP module in runtime.
    • moving_average: Simple program that demonstrates parallel computation of a moving average of one-dimensional data.
    • multi_gpu_data_transfer: Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices.
    • occupancy: Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy.
    • opengl_interop: Showcases how to share resources and computation between HIP and OpenGL.
    • runtime_compilation: Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device.
    • saxpy: Implements the $y_i=ax_i+y_i$ kernel and explains basic HIP functionality.
    • shared_memory: Showcases how to use static shared memory by implementing a simple matrix transpose kernel.
    • static_device_library: Shows how to create a static library containing device functions, and how to link it with an executable.
    • static_host_library: Shows how to create a static library containing HIP host functions, and how to link it with an executable.
    • streams: Program that showcases usage of multiple streams each with their own tasks.
    • texture_management: Shows the usage of texture memory.
    • vulkan_interop: Showcases how to share resources and computation between HIP and Vulkan.
    • warp_shuffle: Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations.
  • Dockerfiles hosts Dockerfiles with ready-to-use environments for the various samples. See Dockerfiles/README.md for details.
  • Docs
    • CONTRIBUTING.md contains information on how to contribute to the examples.
  • Libraries
    • hipBLAS
      • gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
      • her: Showcases a rank-2 update of a Hermitian matrix with complex values.
      • scal: Simple program that showcases vector scaling (SCAL) operation.
    • hipCUB
      • device_radix_sort: Simple program that showcases hipcub::DeviceRadixSort::SortPairs.
      • device_sum: Simple program that showcases hipcub::DeviceReduce::Sum.
    • hipSOLVER
      • gels: Solve a linear system of the form $A\times X=B$.
      • geqrf: Program that showcases how to obtain a QR decomposition with the hipSOLVER API.
      • gesvd: Program that showcases how to obtain a singular value decomposition with the hipSOLVER API.
      • getrf: Program that showcases how to perform a LU factorization with hipSOLVER.
      • potrf: Perform Cholesky factorization and solve linear system with result.
      • syevd: Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER.
      • syevdx: Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER.
      • sygvd: Showcases how to obtain a solution $(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form $A \cdot X = B\cdot X \cdot \Lambda$.
      • syevj: Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method.
      • syevj_batched: Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices.
      • sygvj: Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method.
    • rocBLAS
      • level_1: Operations between vectors and vectors.
        • axpy: Simple program that showcases the AXPY operation.
        • dot: Simple program that showcases dot product.
        • nrm2: Simple program that showcases Euclidean norm of a vector.
        • scal: Simple program that showcases vector scaling (SCAL) operation.
        • swap: Showcases exchanging elements between two vectors.
      • level_2: Operations between vectors and matrices.
        • her: Showcases a rank-1 update of a Hermitian matrix with complex values.
        • gemv: Showcases the general matrix-vector product operation.
      • level_3: Operations between matrices and matrices.
        • gemm: Showcases the general matrix product operation.
        • gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
    • rocPRIM
      • block_sum: Simple program that showcases rocprim::block_reduce with an addition operator.
      • device_sum: Simple program that showcases rocprim::reduce with an addition operator.
    • rocRAND
      • simple_distributions_cpp: A command-line app to compare random number generation on the CPU and on the GPU with rocRAND.
    • rocSOLVER
      • getf2: Program that showcases how to perform a LU factorization with rocSOLVER.
      • getri: Program that showcases matrix inversion by LU-decomposition using rocSOLVER.
      • syev: Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix.
      • syev_batched: Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices.
      • syev_strided_batched: Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride.
    • rocSPARSE
      • level_2: Operations between sparse matrices and dense vectors.
        • bsrmv: Showcases a sparse matrix-vector multiplication using BSR storage format.
        • bsrxmv: Showcases a masked sparse matrix-vector multiplication using BSR storage format.
        • bsrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix.
        • coomv: Showcases a sparse matrix-vector multiplication using COO storage format.
        • csrmv: Showcases a sparse matrix-vector multiplication using CSR storage format.
        • csrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
        • ellmv: Showcases a sparse matrix-vector multiplication using ELL storage format.
        • gebsrmv: Showcases a sparse matrix-dense vector multiplication using GEBSR storage format.
      • level_3: Operations between sparse and dense matrices.
        • bsrmm: Showcases a sparse matrix-matrix multiplication using BSR storage format.
        • bsrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
        • csrmm: Showcases a sparse matrix-matrix multiplication using CSR storage format.
        • csrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
        • gebsrmm: Showcases a sparse matrix-matrix multiplication using GEBSR storage format.
      • preconditioner: Manipulations on sparse matrices to obtain sparse preconditioner matrices.
        • bsric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix.
        • bsrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix.
        • csric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix.
        • csrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix.
        • csritilu0: Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix.
    • rocThrust
      • device_ptr: Simple program that showcases the usage of the thrust::device_ptr template.
      • norm: An example that computes the Euclidean norm of a thrust::device_vector.
      • reduce_sum: An example that computes the sum of a thrust::device_vector integer vector using the thrust::reduce() generalized summation and the thrust::plus operator.
      • remove_points: Simple program that demonstrates the usage of the thrust random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle.
      • saxpy: Simple program that implements the SAXPY operation (y[i] = a * x[i] + y[i]) using rocThrust and showcases the usage of the vector and functor templates and of thrust::fill and thrust::transform operations.
      • vectors: Simple program that showcases the host_vector and the device_vector of rocThrust.

Prerequisites

Linux

  • CMake (at least version 3.21)
  • A number of examples also support building via GNU Make - available through the distribution's package manager
  • ROCm (at least version 5.x.x)
  • For example-specific prerequisites, see the example subdirectories.

Windows

  • Visual Studio 2019 or 2022 with the "Desktop Development with C++" workload
  • ROCm toolchain for Windows (No public release yet)
    • The Visual Studio ROCm extension needs to be installed to build with the solution files.
  • CMake (optional, to build with CMake. Requires at least version 3.21)
  • Ninja (optional, to build with CMake)

Building the example suite

Linux

These instructions assume that the prerequisites for every example are installed on the system.

CMake

See CMake build options for an overview of build options.

  • $ git clone https://github.com/amd/rocm-examples.git
  • $ cd rocm-examples
  • $ cmake -S . -B build (on ROCm) or $ cmake -S . -B build -D GPU_RUNTIME=CUDA (on CUDA)
  • $ cmake --build build
  • $ cmake --install build --prefix install

Make

Beware that only a subset of the examples support building via Make.

  • $ git clone https://github.com/amd/rocm-examples.git
  • $ cd rocm-examples
  • $ make (on ROCm) or $ make GPU_RUNTIME=CUDA (on CUDA)

Linux with Docker

Alternatively, instead of installing the prerequisites on the system, the Dockerfiles in this repository can be used to build images that provide all required prerequisites. Note, that the ROCm kernel GPU driver still needs to be installed on the host system.

The following instructions showcase building the Docker image and full example suite inside the container using CMake:

  • $ git clone https://github.com/amd/rocm-examples.git
  • $ cd rocm-examples/Dockerfiles
  • $ docker build . -t rocm-examples -f hip-libraries-rocm-ubuntu.Dockerfile (on ROCm) or $ docker build . -t rocm-examples -f hip-libraries-cuda-ubuntu.Dockerfile (on CUDA)
  • $ docker run -it --device /dev/kfd --device /dev/dri rocm-examples bash (on ROCm) or $ docker run -it --gpus=all rocm-examples bash (on CUDA)
  • # git clone https://github.com/amd/rocm-examples.git
  • # cd rocm-examples
  • # cmake -S . -B build (on ROCm) or $ cmake -S . -B build -D GPU_RUNTIME=CUDA (on CUDA)
  • # cmake --build build

The built executables can be found and run in the build directory:

  • # ./build/Libraries/rocRAND/simple_distributions_cpp/simple_distributions_cpp

Windows

Visual Studio

The repository has Visual Studio project files for all examples and individually for each example.

  • Project files for Visual Studio are named as the example with _vs<Visual Studio Version> suffix added e.g. device_sum_vs2019.sln for the device sum example.
  • The project files can be built from Visual Studio or from the command line using MSBuild.
    • Use the build solution command in Visual Studio to build.
    • To build from the command line execute C:\Program Files (x86)\Microsoft Visual Studio\<Visual Studio Version>\<Edition>\MSBuild\Current\Bin\MSBuild.exe <path to project folder>.
      • To build in Release mode pass the /p:Configuration=Release option to MSBuild.
      • The executables will be created in a subfolder named "Debug" or "Release" inside the project folder.
  • The HIP specific project settings like the GPU architectures targeted can be set on the General [AMD HIP C++] tab of project properties.
  • The top level solution files come in two flavors: ROCm-Examples-VS<Visual Studio Verson>.sln and ROCm-Examples-Portable-VS<Visual Studio Version>.sln. The former contains all examples, while the latter contains the examples that support both ROCm and CUDA.

CMake

First, clone the repository and go to the source directory.

git clone https://github.com/amd/rocm-examples.git
cd rocm-examples

There are two ways to build the project using CMake: with the Visual Studio Developer Command Prompt (recommended) or with a standard Command Prompt. See CMake build options for an overview of build options.

Visual Studio Developer Command Prompt

Select Start, search for "x64 Native Tools Command Prompt for VS 2019", and the resulting Command Prompt. Ninja must be selected as generator, and Clang as C++ compiler.

cmake -S . -B build -G Ninja -D CMAKE_CXX_COMPILER=clang
cmake --build build
Standard Command Prompt

Run the standard Command Prompt. When using the standard Command Prompt to build the project, the Resource Compiler (RC) path must be specified. The RC is a tool used to build Windows-based applications, its default path is C:/Program Files (x86)/Windows Kits/10/bin/<Windows version>/x64/rc.exe. Finally, the generator must be set to Ninja.

cmake -S . -B build -G Ninja -D CMAKE_RC_COMPILER="<path to rc compiler>"
cmake --build build

CMake build options

The following options are available when building with CMake.

Option Relevant to Default value Description
GPU_RUNTIME HIP / CUDA "HIP" GPU runtime to compile for. Set to "CUDA" to compile for NVIDIA GPUs and to "HIP" for AMD GPUs.
CMAKE_HIP_ARCHITECTURES HIP Compiler default HIP device architectures to target, e.g. "gfx908;gfx1030" to target architectures gfx908 and gfx1030.
CMAKE_CUDA_ARCHITECTURES CUDA Compiler default CUDA architecture to compile for e.g. "50;72" to target compute capibility 50 and 72.

More Repositories

1

ROCm

AMD ROCmâ„¢ Software - GitHub Home
Python
4,149
star
2

HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
C++
3,398
star
3

MIOpen

AMD's Machine Intelligence Library
Assembly
970
star
4

hcc

HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
C++
421
star
5

HIPIFY

HIPIFY: Convert CUDA to Portable C++ Code
C++
393
star
6

rocBLAS

Next generation BLAS implementation for ROCm platform
C++
308
star
7

composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
C++
210
star
8

rccl

ROCm Communication Collectives Library (RCCL)
C++
206
star
9

ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
C++
190
star
10

Tensile

Stretching GPU performance for GEMMs and tensor contractions.
Python
187
star
11

aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
Fortran
178
star
12

MIVisionX

MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVXâ„¢ and OpenVXâ„¢ Extensions.
C++
168
star
13

gpufort

GPUFORT: S2S translation tool for CUDA Fortran and Fortran+X in the spirit of hipify
Fortran
157
star
14

AMDMIGraphX

AMD's graph optimization engine.
C++
156
star
15

rocFFT

Next generation FFT implementation for ROCm
C++
144
star
16

rocPRIM

ROCm Parallel Primitives
C++
142
star
17

omniperf

Advanced Profiling and Analytics for AMD Hardware
Python
118
star
18

rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
C++
110
star
19

rocm_smi_lib

ROCm SMI LIB
C++
106
star
20

HIP-CPU

An implementation of HIP that works on CPUs, across OSes.
C++
105
star
21

rocMLIR

105
star
22

rocSPARSE

Next generation SPARSE implementation for ROCm platform
C++
104
star
23

rocRAND

RAND library for HIP programming language
C++
101
star
24

ROCm-Device-Libs

ROCm Device Libraries
C
100
star
25

rocThrust

ROCm Thrust - run Thrust dependent software on AMD GPUs
C++
88
star
26

rocSOLVER

Next generation LAPACK implementation for ROCm platform
C++
85
star
27

rocWMMA

rocWMMA
C++
68
star
28

hipCUB

Reusable software components for ROCm developers
C++
68
star
29

hipfort

Fortran interfaces for ROCm libraries
Fortran
65
star
30

atmi

Asynchronous Task and Memory Interface, or ATMI, is a runtime framework and programming model for heterogeneous CPU-GPU systems. It provides a consistent, declarative API to create task graphs on CPUs and GPUs (integrated and discrete).
C++
65
star
31

rocALUTION

Next generation library for iterative sparse solvers for ROCm platform
C++
62
star
32

roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
C++
58
star
33

hipSPARSE

ROCm SPARSE marshalling library
C++
58
star
34

rocm-cmake

CMake modules used within the ROCm libraries
CMake
51
star
35

ROCmValidationSuite

The ROCm Validation Suite is a system administrator’s and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.
C++
49
star
36

amd_matrix_instruction_calculator

A tool for generating information about the matrix multiplication instructions in AMD Radeonâ„¢ and AMD Instinctâ„¢ accelerators
Python
48
star
37

rpp

AMD ROCm Performance Primitives (RPP) library is a comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends.
C++
46
star
38

ROCclr

44
star
39

ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
C++
42
star
40

hipFFT

hipFFT is a FFT marshalling library.
C++
40
star
41

ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
C
40
star
42

HIPCC

HIPCC: HIP compiler driver
C++
38
star
43

Experimental_ROC

Experimental and Intriguing Tools for ROCm
Shell
35
star
44

ROC_SHMEM

ROC_SHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
C++
34
star
45

rocm_bandwidth_test

Bandwidth test for ROCm
C++
34
star
46

MISA

Machine Intelligence Shader Autogen. AMDGPU ML shader code generator. (previously iGEMMgen)
Python
31
star
47

ROCm.github.io

ROCm Website
30
star
48

rocHPCG

HPCG benchmark based on ROCm platform
C++
30
star
49

clang-ocl

OpenCL compilation with clang compiler.
CMake
26
star
50

amdsmi

AMD SMI
C++
25
star
51

ROCm-OpenCL-Driver

ROCm OpenCL Compiler Tool Driver
C++
24
star
52

rccl-tests

RCCL Performance Benchmark Tests
Cuda
21
star
53

hipSOLVER

ROCm SOLVER marshalling library
C++
21
star
54

ROCdbgapi

The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures.
C++
19
star
55

rocm-blogs

Jupyter Notebook
16
star
56

hip-tests

C++
15
star
57

rdc

RDC
C++
14
star
58

TransferBench

TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)
C++
14
star
59

aotriton

Ahead of Time (AOT) Triton Math Library
Python
13
star
60

hipRAND

Random number library that generate pseudo-random and quasi-random numbers.
C++
13
star
61

hip-python

HIP Python Low-level Bindings
Cython
13
star
62

hipify_torch

Python
13
star
63

pytorch-micro-benchmarking

Python
12
star
64

rocm-docs-core

ROCm Documentation Python package for ReadTheDocs build standardization
CSS
12
star
65

roc-stdpar

C++
10
star
66

pyrsmi

python package of rocm-smi-lib
Python
10
star
67

rocmProfileData

C++
10
star
68

half

C++
9
star
69

OSU_Microbenchmarks

ROCm - UCX enabled OSU_Benchmarks
C
8
star
70

rocAL

The AMD rocAL is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user.
C++
8
star
71

MITuna

Python
7
star
72

rtg_tracer

C++
7
star
73

rocm-spack-pkgs

Repository to host spack recipes for ROCm
Python
6
star
74

rbuild

Rocm build tool
Python
6
star
75

hip-testsuite

Python
4
star
76

MIFin

Tuna centric MIOpen client
C++
4
star
77

Gromacs

ROCm's implementation of Gromacs
C++
3
star
78

flang

Mirror of flang repo: The source repo is https://github.com/flang-compiler/flang . Once a day the master branch is updated from the upstream source repo and then locked. AOMP or ROCm developers may commit or create PRs on branch aomp-dev.
C++
3
star
79

rocm-core

CMake
3
star
80

hipSPARSELt

C++
2
star
81

aomp-extras

hostcall services library, math library, and utilities
Shell
2
star
82

MIOpenExamples

MIOpen examples
C++
2
star
83

rocm-recipes

Recipes for rocm
CMake
1
star