Using Nsight Compute and Nsight Systems

Based on slides by Carl Pearson of University of Illinois

Introduction to Profiling

System- and Kernel-Level Profiling

Nsight Compute

Kernel-Level Profiling

How fast does the GPU execute my kernel?

Nsight Systems

System-Level Profiling

How efficiently is my system delivering work to the GPU?
What is my stem doing while the GPU is working?
How fast is data moving to/from the GPU?
How much time does the CPU take to control the GPU?
When does asynchronous operations occur?

Common GPU Development Model

							digraph {
								layout="neato"
								mode="sgd"
								beautify=true
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.2]
								edge [len=2]

								Client [label="You on your laptop" pos="0,1"]
								Job [label="Specify job on Mogon" pos="2,2"]
								Mogon [label="Server with GPUs" pos="4,1"]
								Result [label="Downlod from Mogon" pos="2,0"]

								Client -> Job;
								Job -> Mogon;
								Mogon -> Result;
								Result -> Client;
							}

Two-Phase Profiling

							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.25]

								target[label="Record Profiling Data on target\n\nnsys profile ...\nncu ..."]
								client[label="Analyse profiling data on client\n\nnsys-ui\nncu-ui"]

								target -> client [label="Copy profiling data to client\n\nssh, scp, smb, ...", len=5]

							}

Preparing for Profiling

Host Code Annotations

You can use the NVIDIA Tools Extensions to annotate your code for profiling

#include <nvtx3/nvtx3.hpp>

Example

						
						#include <nvtx3/nvtx3.hpp>
						void some_function() {
							NVTX3_FUNC_RANGE();  // Range around the whole function
							for (int i = 0; i < 6; ++i) {
								nvtx3::scoped_range loop{"loop range"};  // Range for iteration

								// Make each iteration last for one second
								std::this_thread::sleep_for(std::chrono::seconds{1});
							}
						}

Correctness

Subtle errors that do not cause your kernel to terminate under normal conditions can cause errors with profiling
- esp. writing outside of allocated memory
Run your code with compute-sanitizer if profiling crashes or misbehaves
- Automatically instruments for example for bad memory behaviour
- Causes something like 100x slowdown, so try small datasets first
- Fix any errors that come up, then profile again


						compute-sanitizer --tool memcheck  ./my-cuda-binary
						compute-sanitizer --tool racecheck ./my-cuda-binary
						compute-sanitizer --tool initcheck ./my-cuda-binary
						compute-sanitizer --tool synccheck ./my-cuda-binary

Compiling

Compile device code with optimizations
- non-optimized or debug code often has many more memory references
- nvcc by default applies many optimizations to device code
- remove any -G, -pg, -g flags
Compile device code with line number annotations
- add -lineinfo flag to all nvcc calls
- puts some info in the binary about what source file locations generated what machine code

~~$ nvcc -G main.cu~~ → $ nvcc -lineinfo main.cu

Caveats

Profiling affects the performance of your kernel!

It will help you improve the speed, but do not report the time during profiling as the performance of your code. Always run and time without profiling.

Kernel Profiling with Nsight Compute

NVIDIA Nsight Compute

Record and analyse detailed kernel performance metrics
Two interfaces:

GUI

ncu-ui

CLI

ncu
Directly consuming 1000 metrics is challenging, we use the GUI to help
Use a two-part record-then-analyse flow with Supercomputer-Clusters

							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.35]

								cli[label="Record data on target platform", xlabel="ncu"]
								gui[label="Analyse data on client", xlabel="ncu-ui"]

								cli -> gui [label="download", len=10]

							}

Kernel Profiling

Device has many performance counters to record detailed information

Made available as "metrics".
To get a list: ncu --devices 0 --query-metrics

Record kernel traces


					$ ncu                             \
					--kernel-id ::my_kernel:6         \
					--section ".*"                    \
					-o my_kernel_%h_$(date "+%F")_%i  \
					./my_kernel

--kernel-id ::my_kernel:6: Profile the 6th time the "my_kernel" kernel runs (i.e. 5 warmups) --section ".*": Record metrics for all report sections -o my_kernel_%h_$(date "+%F")_%i: Creates a report file named after the host, the date and an ID ./my_kernel: Name of the executable to profile

Nsight Compute Sections

Sections are a group of related measurements.

The default list can be generated by $ ncu --list-sections

--sections ".*" provides a regex which selects all sections instead.

Tabs
Button to add Baseline

After adding a baseline, the other tab compares it measurements with it.

Speed of Light

The speed of light (SOL) is a measure of how much of the GPU's capabilities have been used. This means that an algorithm with 100% SOL used all the memory bandwidth, all the threads, there was no waiting, and so on.

Workload Memory Analysis

Memory Chart

Global Memory: shared by all threads
Local Memory: private per-thread
Shared Memory: shared by threads in a block
Texture/Surface: Cached for 2D spatial locality
Constant: Cached in the constant cache

Workload Memory Analysis: Charts

Detailed information summarized
TEX means the first-level cache

Scheduler Statistics

Theoretical Warps: Pool of warps that the scheduler can pick from. Limited by device.
Active Warps: Number of warps actually given to SM: not enough work, or work imbalance
Eligible Warps: Number of warps ready to execute: waiting for barrier, watching for instruction fetch, waiting for data…
Issued Warps: Number of issued warps: usually maximum of 1 or 2 depending on hardware.

Just because average value is good, doesn’t mean warp scheduling chances are missed

Warp State Statistics

Warp cycles per issued instructions: latency between two consecutive instructions.
More latency → More warp parallelism needed to hide.
Warp State: average number of cycles spent in that state for each instruction
Stalls cannot always be avoided and only really matter if instructions can’t be issued every cycle

Instruction Hotspots

Switch to the "Source" page.

Show various metrics correlated with source code lines and PTX instructions

Some source code lines create many many PTX instructions: sometimes, split up a source line into many lines to get more details

If profiling on a different system, source file may not automatically load since paths may not match.

Click "resolve" and find your local copy of the code that was compiled or run remotely

Instruction Sampling

Every so often, the position of the program counter is recorded
Slower instructions are more likely to be recorded
There will be many samples in slow parts of the code, and few in fast parts of the code

Hotspots

The program counter spends most of its time on instructions from this line. Mouse over for breakdown.

When optimizing code, focus on the parts that are runtime intensive and/or use a lot of registers.

System Profiling with Nsight Systems

NVIDIA Nsight Systems

Deliver work to the GPU effectively
- Understand performance of surrounding system
Two interfaces:

GUI

nsys-ui

CLI

nsys
Again, use a two-part record-then-analyse flow with Supercomputer-Clusters

							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.35]

								cli[label="Record data on target platform", xlabel="nsys"]
								gui[label="Analyse data on client", xlabel="nsys-ui"]

								cli -> gui [label="download", len=10]

							}

Record kernel traces


					$ nsys profile                    \
					-o my_kernel_%h_$(date "+%F")_%i  \
					./my_kernel

nsys profile: Tells nsys to profile the executable. -o my_kernel_%h_$(date "+%F")_%i: Creates a report file named after the host, the date and an ID ./my_kernel: Name of the executable to profile

Kernel Time vs. Wall Time

CPU activity
GPU activity

Overlap

No overlap of transfer and kernel (3.5 ms)

Overlap of transfer and kernel! (2.5ms)

Not Discussed

Measuring across multiple streams with CUDA events
Profiling through the Nsight Compute GUI
Profiling through the Nsight Systems GUI
In-kernel timing with clock()/clock64()
Custom profiling hooks with CUDA Performance Tools Interface (CUPTI)