• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language Cuda
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Instructions, Docker images, and examples for Nsight Compute and Nsight Systems

Nvidia Performance Tools

Build Status

Docker images with Nvidia's Performance Tools

cwpearson/nvidia-performance-tools on Docker Hub.

docker pull cwpearson/nvidia-performance-tools:latest-amd64   # for x86
docker pull cwpearson/nvidia-performance-tools:latest-ppc64le # for POWER

Typically, you'll want the latest-amd64 or latest-ppc64le tags. If you are developing a workflow and want stability, choose a tag like amd64-10.1-master-ce03360, which describes the architecture, CUDA version, branch, and short SHA of the corresponding git commit for cwpearson/nvidia-performance-tools on Github.

Presentations

Examples

  • sgemm Featuring basic, shared-memory tiled, and joint shared-memory and register tiling.
  • coalescing Featuring a simple code with and without memory coalescing, and discussion of how to analyze efficiency in Nsight Compute

Installing Nsight Systems and Nsight Compute

There is a command-line (CLI) and graphical (GUI) version of each tool. They will be installed together, unless a CLI-only version is downloaded.

  • macOS: You probably don't have CUDA installed, so download the Nsight Systems or Compute installer from the Nvidia website.
  • Windows with CUDA:
    • with CUDA: You may already find Nsight Systems or Compute in your start menu. You can download a more recent release from the Nvidia website. If you install it, you will have two entries in the start menu for different versions.
    • without CUDA: Download the Nsight Systems or Compute installer from the CUDA website.
  • Linux
    • with CUDA: you may already have Nsight Systems and Compute (check /usr/local/cuda/bin/nsight-sys and /usr/local/cuda/bin/nv-nsight-cu). If so, you can still download the Nsight Systems or Compute .deb package to update. It may override the package that was installed with CUDA. You can also use the .run file, which you should install to a directory not managed by the package manager, and add the location of the resulting binary files to your path.
    • without CUDA:
      • .deb: Download the .deb package and install it. Requires root privileges
      • .run: Download the .run package and execute it. Choose a file system that you have permission to install to, and then add the resulting binary directory to your path.

Preparing for Profiling

Source code annotations

#include <nvToolsExt.h>

nvtxRangePush("span 1");
nvtxRangePush("a nested span");
nvtxRangePop(); // end nested span
nvtxRangePop(); // end span 1

Also link with -lnvToolsExt.

nvcc

Compile with optimizations turned on, and without debug information. The most linkely relevant flags for nvcc are below:

--profile                                       (-pg)                           
        Instrument generated code/executable for use by gprof (Linux only).

--debug                                         (-g)                            
        Generate debug information for host code.

--device-debug                                  (-G)                            
        Generate debug information for device code. Turns off all optimizations.
        Don't use for profiling; use -lineinfo instead.

--generate-line-info                            (-lineinfo)                     
        Generate line-number information for device code.

So, change nvcc -g/-pg/-G ... to nvcc <your optimization flags> -lineinfo ....

cuda-memcheck

If your code overwrites unallocated memory, it may corrupt the profiling process. If profiling fails, try running your code under cuda-memcheck. This will instrument your binary to detect bad GPU memory activity. Fix any errors that occur, and try profiling again. This will cause ~100x slowdown usually, so try a small dataset first.

cuda-memcheck my-binary

Nsight Systems Environment Check

Run nsys status -e. You should see something like

Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 4.16.15-41615: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK

Errors may reduce the amount of information collected, or cause profiling to fail. Consult documentation for troubleshooting steps.

Capturing a Profile with CLI

Under this scheme, we

  • use the CLI on the target to record a profiling file
  • transfer that file to the client
  • use the GUI on the client to analyze the record

Nsight Compute

This command will

  • Generate a.nsight-cuprof-report with recorded profiling information
  • Measure metrics associated with all sections
  • Profile the 6th invocation of __global__ void kernel_name(...)
  • Run a.out
nv-nsight-cu-cli \ 
  -o a \
  --sections ".*" \
  --kernel-id ::kernel_name:6 \
  a.out

To see sections that will be recorded for a command, add --list-sections.

nv-nsight-cu-cli --list-sections
---------------------------- ------------------------------- --------------------------------------------------
Identifier                    Display Name                    Filename                                          
----------------------------- ------------------------------- --------------------------------------------------
ComputeWorkloadAnalysis       Compute Workload Analysis       .../../../sections/ComputeWorkloadAnalysis.section
InstructionStats              Instruction Statistics          ...64/../../sections/InstructionStatistics.section
LaunchStats                   Launch Statistics               ...1_3-x64/../../sections/LaunchStatistics.section
MemoryWorkloadAnalysis        Memory Workload Analysis        ...4/../../sections/MemoryWorkloadAnalysis.section
MemoryWorkloadAnalysis_Chart  Memory Workload Analysis Chart  ..../sections/MemoryWorkloadAnalysis_Chart.section
MemoryWorkloadAnalysis_Tables Memory Workload Analysis Tables .../sections/MemoryWorkloadAnalysis_Tables.section
Occupancy                     Occupancy                       ...ibc_2_11_3-x64/../../sections/Occupancy.section
SchedulerStats                Scheduler Statistics            ...-x64/../../sections/SchedulerStatistics.section
SourceCounters                Source Counters                 ..._11_3-x64/../../sections/SourceCounters.section
SpeedOfLight                  GPU Speed Of Light              ..._2_11_3-x64/../../sections/SpeedOfLight.section
WarpStateStats                Warp State Statistics           ...-x64/../../sections/WarpStateStatistics.section

To see supported metrics on a device, do nv-nsight-cu-cli --devices 0 --query-metrics

On some newer devices, the base metrics name will not work. You need to append an allowed suffix. To see all the legal names and suffices, do nv-nsight-cu-cli --devices 0 --query-metrics --query-metrics-mode all

The --kernel-id flag takes a string like context-id:stream-id:[name-operator:]kernel-name:invocation-nr. Commonly, we might only use kernel-name, to select kernels to profile by name, and invocation-nr, to select which invocation of the kernels to profile.

Nsight Systems

This command will

  • Record profiling info to a.qdreq
  • Run a.out
nsys profile \
  -o a
  a.out

Using the GUI on a client to view a recorded file from the target

In Nsight Compute:

File > Open File ... > file.nsight-cuprof-report

If you profiled on a different system than the GUI tool is running on, and you want to look at a View that includes the source, you may have to click the "resolve" button to nagivate to a local copy of the source file.

In Nsight Systems:

File > Open > file.qdrep

Using the GUI on the client to Control Remote Profiling on the target

instructions to come

Managing docker images

  • docker ps -a
  • docker rm docker ps -a -q``
  • docker system prune

Run a profiling container:

docker run cwpearson/nvidia-performance-tools:latest-amd64

Resume a previously exited container:

* docker ps -a       # find the ID
* docker start <ID>  # resume the exited container
* docker attach <ID> # attach a terminal to the container

For Contributors

See CONTRIBUTING.md

Resources

More Repositories

1

cupti

Profile how CUDA applications create and modify data in memory.
C++
10
star
2

tempi

Topology Experiments for MPI
C++
10
star
3

stencil

A prototype MPI/CUDA stencil communication library
Cuda
10
star
4

llvmvm

LLVM Version Manager
Shell
9
star
5

github-runner-docker

Ephemeral github runners in docker
Shell
8
star
6

azure-pipelines-agent

Self-hosted GPU agent for Azure Piplines with Docker
Python
7
star
7

nim-murmurhash

Pure-nim MurmurHash implementation
Nim
7
star
8

systemd-minecraft

systemd unitfile to manage a minecraft server
6
star
9

opencl2.0-intel-cpu

6
star
10

mpi_test

Examples and tests for MPI+CUDA with CMake
C++
5
star
11

openvprof

Some open CUDA profiling/tracing tools
C++
5
star
12

pcie_performance_counters

4
star
13

cuda-experiments

Cuda
4
star
14

matrix-market

Single-header C++ matrix-market file reader, and tools.
C++
4
star
15

argparse

Limited single-header-only CLI parsing for C++ without std::regex
C++
3
star
16

talk2pdf

Create PDF transcripts of recorded talks
Python
3
star
17

perfect

CPU/GPU performance control library for benchmarking
C++
3
star
18

suitesparse-downloader

Download different subsets of the suitesparse matrix collection
Python
3
star
19

csips

A pure-python integer programming solver
Python
2
star
20

cuda-zipkin-cpp

C++
2
star
21

libscope

systems-oriented benchmark support library for CUDA and NUMA measurements
C++
2
star
22

heteroprof

C++
2
star
23

latex-docker

Latex in Docker, useful for CI services
Makefile
2
star
24

ms-thesis

MS Thesis on NUMA/CUDA microbenchmarking
TeX
1
star
25

graph-datasets

tools for acquiring and analyzing graph datasets
Python
1
star
26

cuda-memory-management

C++ allocators for CUDA memory
C++
1
star
27

graph-datasets2

CLI tool for managing graph datasets
Nim
1
star
28

iid

Benchmarking with automatic determination of iid samples
Python
1
star
29

mlaas-compose

1
star
30

flow

producer/consumer-aware CUDA data structures
CMake
1
star
31

reverse-engineer

Cuda
1
star
32

compiler-optimizations

Shell
1
star
33

nvcc-rs

Rust crate for controlling nvcc
Rust
1
star
34

opencl1.2-intel-cpu

1
star
35

example-cmake-cuda

CMake
1
star
36

beavertails

Resource planning for Timberborn
Python
1
star