• Stars
    star
    237
  • Rank 164,670 (Top 4 %)
  • Language
    C++
  • License
    Other
  • Created almost 3 years ago
  • Updated 5 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Composable Kernel

Methodology

Composable Kernel (CK) library aims to provide a programming model for writing performance critical kernels for machine learning workloads across multiple architectures including GPUs, CPUs, etc, through general purpose kernel languages, like HIP C++.

CK utilizes two concepts to achieve performance portability and code maintainability:

  • A tile-based programming model
  • Algorithm complexity reduction for complex ML operators, using innovative technique we call "Tensor Coordinate Transformation".

ALT

Code Structure

Current CK library are structured into 4 layers:

  • "Templated Tile Operators" layer
  • "Templated Kernel and Invoker" layer
  • "Instantiated Kernel and Invoker" layer
  • "Client API" layer

ALT

Documentation

Run the steps below to build documentation locally.

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Contributors

The list of developers and contributors is here: Contributors

Citation

If you use CK, please use following citations:

License

CK is released under the MIT license. License File

Build CK

Build docker image

DOCKER_BUILDKIT=1 docker build -t ck:latest -f Dockerfile .

Pre-built dockers are available from this public repo: https://hub.docker.com/r/rocm/composable_kernel/tags

Launch docker

docker run                                     \
-it                                            \
--privileged                                   \
--group-add sudo                               \
-w /root/workspace                             \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
ck:latest                                      \
/bin/bash

Build CK

mkdir build && cd build

# Need to specify target ID, example below is for gfx908 and gfx90a

cmake                                                                                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-D CMAKE_BUILD_TYPE=Release                                                                       \
-D GPU_TARGETS="gfx908;gfx90a"                                                                    \
..

If GPU_TARGETS is not set on the cmake command line, CK will be built for all targets supported by the current compiler.

Additional cmake flags can be used to significantly speed-up the build:

INSTANCES_ONLY (by default is OFF) must be set to ON in order to build only the instances and library while skipping all tests, examples, and profiler. This is useful for libraries that use CK as a dependency.

DTYPES (by default not set) can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build instances of select data types only. Currently, building of int8 instances is taking a lot of time (the compiler fix is in the works).

DL_KERNELS (by default is OFF) must be set to ON in order to build the gemm_dl and batched_gemm_multi_d_dl instances. Those instances are only needed for the NAVI2x platforms.

Build examples and tests

 make -j examples tests
 make test

Instructions for running each individual examples are under example

Build ckProfiler

 make -j ckProfiler

Instructions for running ckProfiler are under profiler

Install CK

make install

Using CK as pre-built kernel library

Instructions for using CK as a pre-built kernel library are under client_example

Contributing

When you contribute to Composable Kernel, make sure to run clang-format on all the changed files. We highly recommend using git hooks that are managed by the pre-commit framework. To install hooks, run:

sudo script/install_precommit.sh

This way, pre-commit will add the appropriate hooks to your local repository and automatically run clang-format (and possibly additional checks) before any commit is created.

If you need to uninstall hooks from the repository, you can do so by running the following command:

script/uninstall_precommit.sh

If for any reason, you need to temporarily disable precommit hooks, you can add the --no-verify option to the git commit command.

Caveat

Kernel Timing and Verification

CK's own kernel timer will warn up kernel once, and then run it multiple times to get average kernel time. For some kernels that use atomic add, this will cause output buffer to be accumulated multiple times, causing verification failure. To work around it, do not use CK's own timer and do verification at the same time. CK's own timer and verification in each example and ckProfiler can be enabled or disabled from command line.

More Repositories

1

ROCm

AMD ROCmâ„¢ Software - GitHub Home
Python
4,167
star
2

HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
C++
3,398
star
3

MIOpen

AMD's Machine Intelligence Library
Assembly
988
star
4

hcc

HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
C++
421
star
5

HIPIFY

HIPIFY: Convert CUDA to Portable C++ Code
C++
393
star
6

rocBLAS

Next generation BLAS implementation for ROCm platform
C++
308
star
7

omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
C++
262
star
8

rccl

ROCm Communication Collectives Library (RCCL)
C++
206
star
9

ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
C++
190
star
10

Tensile

Stretching GPU performance for GEMMs and tensor contractions.
Python
187
star
11

aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
Fortran
178
star
12

MIVisionX

MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVXâ„¢ and OpenVXâ„¢ Extensions.
C++
168
star
13

gpufort

GPUFORT: S2S translation tool for CUDA Fortran and Fortran+X in the spirit of hipify
Fortran
157
star
14

AMDMIGraphX

AMD's graph optimization engine.
C++
156
star
15

rocFFT

Next generation FFT implementation for ROCm
C++
144
star
16

rocPRIM

ROCm Parallel Primitives
C++
142
star
17

omniperf

Advanced Profiling and Analytics for AMD Hardware
Python
119
star
18

rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
C++
110
star
19

rocm_smi_lib

ROCm SMI LIB
C++
106
star
20

HIP-CPU

An implementation of HIP that works on CPUs, across OSes.
C++
105
star
21

rocMLIR

105
star
22

rocSPARSE

Next generation SPARSE implementation for ROCm platform
C++
104
star
23

rocRAND

RAND library for HIP programming language
C++
101
star
24

rocm-examples

C++
100
star
25

ROCm-Device-Libs

ROCm Device Libraries
C
100
star
26

rocThrust

ROCm Thrust - run Thrust dependent software on AMD GPUs
C++
88
star
27

rocSOLVER

Next generation LAPACK implementation for ROCm platform
C++
85
star
28

rocWMMA

rocWMMA
C++
68
star
29

hipCUB

Reusable software components for ROCm developers
C++
68
star
30

hipfort

Fortran interfaces for ROCm libraries
Fortran
65
star
31

atmi

Asynchronous Task and Memory Interface, or ATMI, is a runtime framework and programming model for heterogeneous CPU-GPU systems. It provides a consistent, declarative API to create task graphs on CPUs and GPUs (integrated and discrete).
C++
65
star
32

rocALUTION

Next generation library for iterative sparse solvers for ROCm platform
C++
62
star
33

roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
C++
58
star
34

hipSPARSE

ROCm SPARSE marshalling library
C++
58
star
35

ROCmValidationSuite

The ROCm Validation Suite is a system administrator’s and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.
C++
54
star
36

rocm-cmake

CMake modules used within the ROCm libraries
CMake
51
star
37

amd_matrix_instruction_calculator

A tool for generating information about the matrix multiplication instructions in AMD Radeonâ„¢ and AMD Instinctâ„¢ accelerators
Python
48
star
38

ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
C++
46
star
39

rpp

AMD ROCm Performance Primitives (RPP) library is a comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends.
C++
46
star
40

ROCclr

44
star
41

hipFFT

hipFFT is a FFT marshalling library.
C++
40
star
42

ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
C
40
star
43

HIPCC

HIPCC: HIP compiler driver
C++
38
star
44

Experimental_ROC

Experimental and Intriguing Tools for ROCm
Shell
35
star
45

ROC_SHMEM

ROC_SHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
C++
34
star
46

rocm_bandwidth_test

Bandwidth test for ROCm
C++
34
star
47

MISA

Machine Intelligence Shader Autogen. AMDGPU ML shader code generator. (previously iGEMMgen)
Python
31
star
48

ROCm.github.io

ROCm Website
30
star
49

rocHPCG

HPCG benchmark based on ROCm platform
C++
30
star
50

amdsmi

AMD SMI
C++
27
star
51

clang-ocl

OpenCL compilation with clang compiler.
CMake
26
star
52

ROCm-OpenCL-Driver

ROCm OpenCL Compiler Tool Driver
C++
24
star
53

rccl-tests

RCCL Performance Benchmark Tests
Cuda
21
star
54

hipSOLVER

ROCm SOLVER marshalling library
C++
21
star
55

ROCdbgapi

The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures.
C++
19
star
56

rocm-blogs

Jupyter Notebook
16
star
57

hip-tests

C++
15
star
58

rdc

RDC
C++
14
star
59

TransferBench

TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)
C++
14
star
60

aotriton

Ahead of Time (AOT) Triton Math Library
Python
13
star
61

hipRAND

Random number library that generate pseudo-random and quasi-random numbers.
C++
13
star
62

hip-python

HIP Python Low-level Bindings
Cython
13
star
63

hipify_torch

Python
13
star
64

pytorch-micro-benchmarking

Python
12
star
65

rocm-docs-core

ROCm Documentation Python package for ReadTheDocs build standardization
CSS
12
star
66

roc-stdpar

C++
10
star
67

pyrsmi

python package of rocm-smi-lib
Python
10
star
68

rocmProfileData

C++
10
star
69

half

C++
9
star
70

OSU_Microbenchmarks

ROCm - UCX enabled OSU_Benchmarks
C
8
star
71

rocAL

The AMD rocAL is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user.
C++
8
star
72

MITuna

Python
7
star
73

rtg_tracer

C++
7
star
74

rocm-spack-pkgs

Repository to host spack recipes for ROCm
Python
6
star
75

rbuild

Rocm build tool
Python
6
star
76

hip-testsuite

Python
4
star
77

MIFin

Tuna centric MIOpen client
C++
4
star
78

Gromacs

ROCm's implementation of Gromacs
C++
3
star
79

flang

Mirror of flang repo: The source repo is https://github.com/flang-compiler/flang . Once a day the master branch is updated from the upstream source repo and then locked. AOMP or ROCm developers may commit or create PRs on branch aomp-dev.
C++
3
star
80

rocm-core

CMake
3
star
81

hipSPARSELt

C++
2
star
82

aomp-extras

hostcall services library, math library, and utilities
Shell
2
star
83

MIOpenExamples

MIOpen examples
C++
2
star
84

rocm-recipes

Recipes for rocm
CMake
1
star