• Stars
    star
    283
  • Rank 145,119 (Top 3 %)
  • Language
    C++
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Omnitrace: Application Profiling, Tracing, and Analysis

Omnitrace: Application Profiling, Tracing, and Analysis

Ubuntu 18.04 with GCC and MPICH Ubuntu 20.04 with GCC, ROCm, and MPI Ubuntu 22.04 (GCC, Python, ROCm) OpenSUSE 15.x with GCC RedHat Linux (GCC, Python, ROCm) Installer Packaging (CPack) Documentation

Omnitrace is an AMD open source research project and is not supported as part of the ROCm software stack.

Overview

AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems. If you are familiar with rocprof and/or uProf, you will find many of the capabilities of these tools available via Omnitrace in addition to many new capabilities.

Omnitrace is a comprehensive profiling and tracing tool for parallel applications written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU+GPU. It is capable of gathering the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. In addition to runtimes, omnitrace supports the collection of system-level metrics such as the CPU frequency, GPU temperature, and GPU utilization, process-level metrics such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.

Data Collection Modes

  • Dynamic instrumentation
    • Runtime instrumentation
      • Instrument executable and shared libraries at runtime
    • Binary rewriting
      • Generate a new executable and/or library with instrumentation built-in
  • Statistical sampling
    • Periodic software interrupts per-thread
  • Process-level sampling
    • Background thread records process-, system- and device-level metrics while the application executes
  • Causal profiling
    • Quantifies the potential impact of optimizations in parallel codes
  • Critical trace generation

Data Analysis

  • High-level summary profiles with mean/min/max/stddev statistics
    • Low overhead, memory efficient
    • Ideal for running at scale
  • Comprehensive traces
    • Every individual event/measurement
  • Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
  • Critical trace analysis (alpha)

Parallelism API Support

  • HIP
  • HSA
  • Pthreads
  • MPI
  • Kokkos-Tools (KokkosP)
  • OpenMP-Tools (OMPT)

GPU Metrics

  • GPU hardware counters
  • HIP API tracing
  • HIP kernel tracing
  • HSA API tracing
  • HSA operation tracing
  • System-level sampling (via rocm-smi)
    • Memory usage
    • Power usage
    • Temperature
    • Utilization

CPU Metrics

  • CPU hardware counters sampling and profiles
  • CPU frequency sampling
  • Various timing metrics
    • Wall time
    • CPU time (process and/or thread)
    • CPU utilization (process and/or thread)
    • User CPU time
    • Kernel CPU time
  • Various memory metrics
    • High-water mark (sampling and profiles)
    • Memory page allocation
    • Virtual memory usage
  • Network statistics
  • I/O metrics
  • ... many more

Documentation

The full documentation for omnitrace is available at amdresearch.github.io/omnitrace. See the Getting Started documentation for general tips and a detailed discussion about sampling vs. binary instrumentation.

Quick Start

Installation

  • Visit Releases page
  • Select appropriate installer (recommendation: .sh scripts do not require super-user priviledges unlike the DEB/RPM installers)
    • If targeting a ROCm application, find the installer script with the matching ROCm version
    • If you are unsure about your Linux distro, check /etc/os-release or use the omnitrace-install.py script

If the above recommendation is not desired, download the omnitrace-install.py and specify --prefix <install-directory> when executing it. This script will attempt to auto-detect a compatible OS distribution and version. If ROCm support is desired, specify --rocm X.Y where X is the ROCm major version and Y is the ROCm minor version, e.g. --rocm 5.4.

wget https://github.com/AMDResearch/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4

See the Installation Documentation for detailed information.

Setup

NOTE: Replace /opt/omnitrace below with installation prefix as necessary.

  • Option 1: Source setup-env.sh script
source /opt/omnitrace/share/omnitrace/setup-env.sh
  • Option 2: Load modulefile
module use /opt/omnitrace/share/modulefiles
module load omnitrace
  • Option 3: Manual
export PATH=/opt/omnitrace/bin:${PATH}
export LD_LIBRARY_PATH=/opt/omnitrace/lib:${LD_LIBRARY_PATH}

Omnitrace Settings

Generate an omnitrace configuration file using omnitrace-avail -G omnitrace.cfg. Optionally, use omnitrace-avail -G omnitrace.cfg --all for a verbose configuration file with descriptions, categories, etc. Modify the configuration file as desired, e.g. enable perfetto, timemory, sampling, and process-level sampling by default and tweak some sampling default values:

# ...
OMNITRACE_USE_PERFETTO         = true
OMNITRACE_USE_TIMEMORY         = true
OMNITRACE_USE_SAMPLING         = true
OMNITRACE_USE_PROCESS_SAMPLING = true
# ...
OMNITRACE_SAMPLING_FREQ        = 50
OMNITRACE_SAMPLING_CPUS        = all
OMNITRACE_SAMPLING_GPUS        = $env:HIP_VISIBLE_DEVICES

Once the configuration file is adjusted to your preferences, either export the path to this file via OMNITRACE_CONFIG_FILE=/path/to/omnitrace.cfg or place this file in ${HOME}/.omnitrace.cfg to ensure these values are always read as the default. If you wish to change any of these settings, you can override them via environment variables or by specifying an alternative OMNITRACE_CONFIG_FILE.

Call-Stack Sampling

The omnitrace-sample executable is used to execute call-stack sampling on a target application without binary instrumentation. Use a double-hypen (--) to separate the command-line arguments for omnitrace-sample from the target application and it's arguments.

omnitrace-sample --help
omnitrace-sample <omnitrace-options> -- <exe> <exe-options>
omnitrace-sample -f 1000 -- ls -la

Binary Instrumentation

The omnitrace executable is used to instrument an existing binary. Call-stack sampling can be enabled alongside the execution an instrumented binary, to help "fill in the gaps" between the instrumentation via setting the OMNITRACE_USE_SAMPLING configuration variable to ON. Similar to omnitrace-sample, use a double-hypen (--) to separate the command-line arguments for omnitrace from the target application and it's arguments.

omnitrace-instrument --help
omnitrace-instrument <omnitrace-options> -- <exe-or-library> <exe-options>

Binary Rewrite

Rewrite the text section of an executable or library with instrumentation:

omnitrace-instrument -o app.inst -- /path/to/app

In binary rewrite mode, if you also want instrumentation in the linked libraries, you must also rewrite those libraries. Example of rewriting the functions starting with "hip" with instrumentation in the amdhip64 library:

mkdir -p ./lib
omnitrace-instrument -R '^hip' -o ./lib/libamdhip64.so.4 -- /opt/rocm/lib/libamdhip64.so.4
export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}

Verify via ldd that your executable will load the instrumented library -- if you built your executable with an RPATH to the original library's directory, then prefixing LD_LIBRARY_PATH will have no effect.

Once you have rewritten your executable and/or libraries with instrumentation, you can just run the (instrumented) executable or exectuable which loads the instrumented libraries normally, e.g.:

omnitrace-run -- ./app.inst

If you want to re-define certain settings to new default in a binary rewrite, use the --env option. This omnitrace option will set the environment variable to the given value but will not override it. E.g. the default value of OMNITRACE_PERFETTO_BUFFER_SIZE_KB is 1024000 KB (1 GiB):

# buffer size defaults to 1024000
omnitrace-instrument -o app.inst -- /path/to/app
omnitrace-run -- ./app.inst

Passing --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 will change the default value in app.inst to 5120000 KiB (5 GiB):

# defaults to 5 GiB buffer size
omnitrace-instrument -o app.inst --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 -- /path/to/app
omnitrace-run -- ./app.inst
# override default 5 GiB buffer size to 200 MB via command-line
omnitrace-run --trace-buffer-size=200000 -- ./app.inst
# override default 5 GiB buffer size to 200 MB via environment
export OMNITRACE_PERFETTO_BUFFER_SIZE_KB=200000
omnitrace-run -- ./app.inst

Runtime Instrumentation

Runtime instrumentation will not only instrument the text section of the executable but also the text sections of the linked libraries. Thus, it may be useful to exclude those libraries via the -ME (module exclude) regex option or exclude specific functions with the -E regex option.

omnitrace-instrument -- /path/to/app
omnitrace-instrument -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
omnitrace-instrument -E 'rocr::atomic|rocr::core|rocr::HSA' --  /path/to/app

Python Profiling and Tracing

Use the omnitrace-python script to profile/trace Python interpreter function calls. Use a double-hypen (--) to separate the command-line arguments for omnitrace-python from the target script and it's arguments.

omnitrace-python --help
omnitrace-python <omnitrace-options> -- <python-script> <script-args>
omnitrace-python -- ./script.py

Please note, the first argument after the double-hyphen must be a Python script, e.g. omnitrace-python -- ./script.py.

If you need to specify a specific python interpreter version, use omnitrace-python-X.Y where X.Y is the Python major and minor version:

omnitrace-python-3.8 -- ./script.py

If you need to specify the full path to a Python interpreter, set the PYTHON_EXECUTABLE environment variable:

PYTHON_EXECUTABLE=/opt/conda/bin/python omnitrace-python -- ./script.py

If you want to restrict the data collection to specific function(s) and its callees, pass the -b / --builtin option after decorating the function(s) with @profile. Use the @noprofile decorator for excluding/ignoring function(s) and its callees:

def foo():
    pass

@noprofile
def bar():
    foo()

@profile
def spam():
    foo()
    bar()

Each time spam is called during profiling, the profiling results will include 1 entry for spam and 1 entry for foo via the direct call within spam. There will be no entries for bar or the foo invocation within it.

Trace Visualization

  • Visit ui.perfetto.dev in the web-browser
  • Select "Open trace file" from panel on the left
  • Locate the omnitrace perfetto output (extension: .proto)

omnitrace-perfetto

omnitrace-rocm

omnitrace-rocm-flow

omnitrace-user-api

Using Perfetto tracing with System Backend

Perfetto tracing with the system backend supports multiple processes writing to the same output file. Thus, it is a useful technique if Omnitrace is built with partial MPI support because all the perfetto output will be coalesced into a single file. The installation docs for perfetto can be found here. If you are building omnitrace from source, you can configure CMake with OMNITRACE_INSTALL_PERFETTO_TOOLS=ON and the perfetto and traced applications will be installed as part of the build process. However, it should be noted that to prevent this option from accidentally overwriting an existing perfetto install, all the perfetto executables installed by omnitrace are prefixed with omnitrace-perfetto-, except for the perfetto executable, which is just renamed omnitrace-perfetto.

Enable traced and perfetto in the background:

pkill traced
traced --background
perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/share/omnitrace.cfg --background

NOTE: if the perfetto tools were installed by omnitrace, replace traced with omnitrace-perfetto-traced and perfetto with omnitrace-perfetto.

Configure omnitrace to use the perfetto system backend via the --perfetto-backend option of omnitrace-run:

# enable sampling on the uninstrumented binary
omnitrace-run --sample --trace --perfetto-backend=system -- ./myapp
# trace the instrument the binary
omnitrace-instrument -o ./myapp.inst -- ./myapp
omnitrace-run --trace --perfetto-backend=system -- ./myapp.inst

or via the --env option of omnitrace-instrument + runtime instrumentation:

omnitrace-instrument --env OMNITRACE_PERFETTO_BACKEND=system -- ./myapp

More Repositories

1

ROCm

AMD ROCmâ„¢ Software - GitHub Home
Shell
4,470
star
2

HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
C++
3,398
star
3

MIOpen

AMD's Machine Intelligence Library
Assembly
1,046
star
4

HIPIFY

HIPIFY: Convert CUDA to Portable C++ Code
C++
440
star
5

hcc

HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
C++
425
star
6

rocBLAS

Next generation BLAS implementation for ROCm platform
C++
308
star
7

composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
C++
285
star
8

rccl

ROCm Communication Collectives Library (RCCL)
C++
231
star
9

Tensile

Stretching GPU performance for GEMMs and tensor contractions.
Python
211
star
10

ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
C++
205
star
11

aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
Fortran
203
star
12

AMDMIGraphX

AMD's graph optimization engine.
C++
181
star
13

rocFFT

Next generation FFT implementation for ROCm
C++
174
star
14

MIVisionX

MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVXâ„¢ and OpenVXâ„¢ Extensions.
C++
168
star
15

gpufort

GPUFORT: S2S translation tool for CUDA Fortran and Fortran+X in the spirit of hipify
Fortran
159
star
16

rocPRIM

ROCm Parallel Primitives
C++
154
star
17

omniperf

Advanced Profiling and Analytics for AMD Hardware
Python
128
star
18

rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
C
122
star
19

rocm-examples

A collection of examples for the ROCm software stack
C++
121
star
20

rocMLIR

C++
120
star
21

rocSPARSE

Next generation SPARSE implementation for ROCm platform
C++
117
star
22

rocm_smi_lib

ROCm SMI LIB
C++
114
star
23

rocRAND

RAND library for HIP programming language
C++
111
star
24

HIP-CPU

An implementation of HIP that works on CPUs, across OSes.
C++
107
star
25

rocThrust

ROCm Thrust - run Thrust dependent software on AMD GPUs
C++
100
star
26

ROCm-Device-Libs

ROCm Device Libraries
C
99
star
27

rocSOLVER

Next generation LAPACK implementation for ROCm platform
C++
91
star
28

hipCUB

Reusable software components for ROCm developers
C++
79
star
29

rocALUTION

Next generation library for iterative sparse solvers for ROCm platform
C++
74
star
30

rocWMMA

rocWMMA
C++
71
star
31

roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
C++
67
star
32

hipSPARSE

ROCm SPARSE marshalling library
C++
67
star
33

hipfort

Fortran interfaces for ROCm libraries
Fortran
66
star
34

atmi

Asynchronous Task and Memory Interface, or ATMI, is a runtime framework and programming model for heterogeneous CPU-GPU systems. It provides a consistent, declarative API to create task graphs on CPUs and GPUs (integrated and discrete).
C++
65
star
35

ROCmValidationSuite

The ROCm Validation Suite is a system administrator’s and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.
C++
61
star
36

rocm-cmake

CMake modules used within the ROCm libraries
CMake
59
star
37

hipFFT

hipFFT is a FFT marshalling library.
C++
52
star
38

amd_matrix_instruction_calculator

A tool for generating information about the matrix multiplication instructions in AMD Radeonâ„¢ and AMD Instinctâ„¢ accelerators
Python
48
star
39

ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
C++
46
star
40

rpp

AMD ROCm Performance Primitives (RPP) library is a comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends.
C++
46
star
41

ROCclr

44
star
42

ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
C
44
star
43

rocm_bandwidth_test

Bandwidth test for ROCm
C++
41
star
44

HIPCC

HIPCC: HIP compiler driver
C++
39
star
45

Experimental_ROC

Experimental and Intriguing Tools for ROCm
Shell
35
star
46

rocHPCG

HPCG benchmark based on ROCm platform
C++
35
star
47

ROC_SHMEM

ROC_SHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
C++
34
star
48

MISA

Machine Intelligence Shader Autogen. AMDGPU ML shader code generator. (previously iGEMMgen)
Python
33
star
49

amdsmi

AMD SMI
C++
32
star
50

ROCm.github.io

ROCm Website
32
star
51

TransferBench

TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)
C++
27
star
52

clang-ocl

OpenCL compilation with clang compiler.
CMake
26
star
53

hipSOLVER

ROCm SOLVER marshalling library
C++
25
star
54

aotriton

Ahead of Time (AOT) Triton Math Library
Python
24
star
55

ROCm-OpenCL-Driver

ROCm OpenCL Compiler Tool Driver
C++
24
star
56

rocm-blogs

Jupyter Notebook
22
star
57

rccl-tests

RCCL Performance Benchmark Tests
Cuda
21
star
58

hipRAND

Random number library that generate pseudo-random and quasi-random numbers.
C++
21
star
59

rdc

RDC
C++
19
star
60

ROCdbgapi

The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures.
C++
19
star
61

pyrsmi

python package of rocm-smi-lib
Python
17
star
62

hip-python

HIP Python Low-level Bindings
Shell
16
star
63

hip-tests

C++
15
star
64

roc-stdpar

C++
14
star
65

pytorch-micro-benchmarking

Python
14
star
66

hipify_torch

Python
13
star
67

rocmProfileData

C++
13
star
68

rocm-docs-core

ROCm Documentation Python package for ReadTheDocs build standardization
CSS
12
star
69

rocAL

The AMD rocAL is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user.
C++
10
star
70

half

C++
9
star
71

rocprofiler-sdk

C++
9
star
72

rocBLAS-Examples

Examples illustrating usage of the rocBLAS library
C++
9
star
73

OSU_Microbenchmarks

ROCm - UCX enabled OSU_Benchmarks
C
8
star
74

MITuna

Python
7
star
75

rtg_tracer

C++
7
star
76

rocm-spack-pkgs

Repository to host spack recipes for ROCm
Python
6
star
77

rbuild

Rocm build tool
Python
6
star
78

Gromacs

ROCm's implementation of Gromacs
C++
5
star
79

rocm-core

CMake
4
star
80

hip-testsuite

Python
4
star
81

MIFin

Tuna centric MIOpen client
C++
4
star
82

rocm-llvm-python

Low-level Cython and Python bindings to the (ROCm) LLVM C API.
Shell
3
star
83

flang

Mirror of flang repo: The source repo is https://github.com/flang-compiler/flang . Once a day the master branch is updated from the upstream source repo and then locked. AOMP or ROCm developers may commit or create PRs on branch aomp-dev.
C++
3
star
84

hipSPARSELt

C++
2
star
85

aomp-extras

hostcall services library, math library, and utilities
Shell
2
star
86

MIOpenExamples

MIOpen examples
C++
2
star
87

hipOMB

OSU MPI benchmarks with ROCm support
C
1
star
88

numba-hip

HIP backend patch for Numba, the NumPy aware dynamic Python compiler using LLVM.
Python
1
star
89

migraphx-benchmark

1
star
90

tensorcast

Python
1
star
91

rocm-recipes

Recipes for rocm
CMake
1
star
92

rocprofiler-register

CMake
1
star
93

rocm-install-on-windows

1
star