Nvidia Performance Tools

Docker images with Nvidia's Performance Tools

cwpearson/nvidia-performance-tools on Docker Hub.

docker pull cwpearson/nvidia-performance-tools:latest-amd64   # for x86
docker pull cwpearson/nvidia-performance-tools:latest-ppc64le # for POWER

Typically, you'll want the latest-amd64 or latest-ppc64le tags. If you are developing a workflow and want stability, choose a tag like amd64-10.1-master-ce03360, which describes the architecture, CUDA version, branch, and short SHA of the corresponding git commit for cwpearson/nvidia-performance-tools on Github.

Presentations

April 21-23 2020 University of Illinois ECE 498 guest lecture for Professors Hwu, Chen, and Xiong.
- Slides
- Recorded Lectures
  - Part 1: Intro, CUDA Events, Nsight Compute
  - Part 2: Nsight Systems
April 16 2020 University of Illinois ECE 408 guest lecture for Professor Lumetta.
- Slides
- Recorded Lecture (75 mins)

Examples

sgemm Featuring basic, shared-memory tiled, and joint shared-memory and register tiling.
coalescing Featuring a simple code with and without memory coalescing, and discussion of how to analyze efficiency in Nsight Compute

Installing Nsight Systems and Nsight Compute

There is a command-line (CLI) and graphical (GUI) version of each tool. They will be installed together, unless a CLI-only version is downloaded.

macOS: You probably don't have CUDA installed, so download the Nsight Systems or Compute installer from the Nvidia website.
Windows with CUDA:
- with CUDA: You may already find Nsight Systems or Compute in your start menu. You can download a more recent release from the Nvidia website. If you install it, you will have two entries in the start menu for different versions.
- without CUDA: Download the Nsight Systems or Compute installer from the CUDA website.
Linux
- with CUDA: you may already have Nsight Systems and Compute (check /usr/local/cuda/bin/nsight-sys and /usr/local/cuda/bin/nv-nsight-cu). If so, you can still download the Nsight Systems or Compute .deb package to update. It may override the package that was installed with CUDA. You can also use the .run file, which you should install to a directory not managed by the package manager, and add the location of the resulting binary files to your path.
- without CUDA:
  - .deb: Download the .deb package and install it. Requires root privileges
  - .run: Download the .run package and execute it. Choose a file system that you have permission to install to, and then add the resulting binary directory to your path.

Preparing for Profiling

Source code annotations

#include <nvToolsExt.h>

nvtxRangePush("span 1");
nvtxRangePush("a nested span");
nvtxRangePop(); // end nested span
nvtxRangePop(); // end span 1

Also link with -lnvToolsExt.

nvcc

Compile with optimizations turned on, and without debug information. The most linkely relevant flags for nvcc are below:

--profile                                       (-pg)                           
        Instrument generated code/executable for use by gprof (Linux only).

--debug                                         (-g)                            
        Generate debug information for host code.

--device-debug                                  (-G)                            
        Generate debug information for device code. Turns off all optimizations.
        Don't use for profiling; use -lineinfo instead.

--generate-line-info                            (-lineinfo)                     
        Generate line-number information for device code.

So, change nvcc -g/-pg/-G ... to nvcc <your optimization flags> -lineinfo ....

cuda-memcheck

If your code overwrites unallocated memory, it may corrupt the profiling process. If profiling fails, try running your code under cuda-memcheck. This will instrument your binary to detect bad GPU memory activity. Fix any errors that occur, and try profiling again. This will cause ~100x slowdown usually, so try a small dataset first.

cuda-memcheck my-binary

Nsight Systems Environment Check

Run nsys status -e. You should see something like

Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 4.16.15-41615: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK

Errors may reduce the amount of information collected, or cause profiling to fail. Consult documentation for troubleshooting steps.

Capturing a Profile with CLI

Under this scheme, we

use the CLI on the target to record a profiling file
transfer that file to the client
use the GUI on the client to analyze the record

Nsight Compute

This command will

Generate a.nsight-cuprof-report with recorded profiling information
Measure metrics associated with all sections
Profile the 6th invocation of __global__ void kernel_name(...)
Run a.out

nv-nsight-cu-cli \ 
  -o a \
  --sections ".*" \
  --kernel-id ::kernel_name:6 \
  a.out

To see sections that will be recorded for a command, add --list-sections.

nv-nsight-cu-cli --list-sections
---------------------------- ------------------------------- --------------------------------------------------
Identifier                    Display Name                    Filename                                          
----------------------------- ------------------------------- --------------------------------------------------
ComputeWorkloadAnalysis       Compute Workload Analysis       .../../../sections/ComputeWorkloadAnalysis.section
InstructionStats              Instruction Statistics          ...64/../../sections/InstructionStatistics.section
LaunchStats                   Launch Statistics               ...1_3-x64/../../sections/LaunchStatistics.section
MemoryWorkloadAnalysis        Memory Workload Analysis        ...4/../../sections/MemoryWorkloadAnalysis.section
MemoryWorkloadAnalysis_Chart  Memory Workload Analysis Chart  ..../sections/MemoryWorkloadAnalysis_Chart.section
MemoryWorkloadAnalysis_Tables Memory Workload Analysis Tables .../sections/MemoryWorkloadAnalysis_Tables.section
Occupancy                     Occupancy                       ...ibc_2_11_3-x64/../../sections/Occupancy.section
SchedulerStats                Scheduler Statistics            ...-x64/../../sections/SchedulerStatistics.section
SourceCounters                Source Counters                 ..._11_3-x64/../../sections/SourceCounters.section
SpeedOfLight                  GPU Speed Of Light              ..._2_11_3-x64/../../sections/SpeedOfLight.section
WarpStateStats                Warp State Statistics           ...-x64/../../sections/WarpStateStatistics.section

To see supported metrics on a device, do nv-nsight-cu-cli --devices 0 --query-metrics

On some newer devices, the base metrics name will not work. You need to append an allowed suffix. To see all the legal names and suffices, do nv-nsight-cu-cli --devices 0 --query-metrics --query-metrics-mode all

The --kernel-id flag takes a string like context-id:stream-id:[name-operator:]kernel-name:invocation-nr. Commonly, we might only use kernel-name, to select kernels to profile by name, and invocation-nr, to select which invocation of the kernels to profile.

Nsight Systems

This command will

Record profiling info to a.qdreq
Run a.out

nsys profile \
  -o a
  a.out

Using the GUI on a client to view a recorded file from the target

In Nsight Compute:

File > Open File ... > file.nsight-cuprof-report

If you profiled on a different system than the GUI tool is running on, and you want to look at a View that includes the source, you may have to click the "resolve" button to nagivate to a local copy of the source file.

In Nsight Systems:

File > Open > file.qdrep

Using the GUI on the client to Control Remote Profiling on the target

instructions to come

Managing docker images

docker ps -a
docker rm docker ps -a -q``
docker system prune

Run a profiling container:

docker run cwpearson/nvidia-performance-tools:latest-amd64

Resume a previously exited container:

* docker ps -a       # find the ID
* docker start <ID>  # resume the exited container
* docker attach <ID> # attach a terminal to the container

For Contributors

See CONTRIBUTING.md

Resources

Nvidia Nsight Systems Docs
Nvidia Nsight Compute Docs
NVIDIA Devloper Blog
Interpreting Nsight Compute Results
- Workload Memory Analysis
  - CUDA Memory Model
  - Device Memory Access Performance Guidelines
- Stall Reasons
  - Nsight Graphics Docs: Stall Reasons
- Issue Efficiency
  - Issue Efficiency Nsight Visual Studio Edition
- Occupancy
  - Nsight Visual Studio Edition
Slides
- docs/GEMM-joint-tiling.ppt: Joint-tiling slide deck from ECE 508 Spring 2017
GTC
- Volta Architecture and Performance Optimization: Volta L1 will cache writes

cwpearson/nvidia-performance-tools

cwpearson

Reviews

Repository Details