Using Nsight Compute and Nsight Systems

Based on slides by Carl Pearson of University of Illinois

Introduction to Profiling

System- and Kernel-Level Profiling

Nsight Compute

Kernel-Level Profiling

  • How fast does the GPU execute my kernel?

Nsight Systems

System-Level Profiling

  • How efficiently is my system delivering work to the GPU?
  • What is my stem doing while the GPU is working?
  • How fast is data moving to/from the GPU?
  • How much time does the CPU take to control the GPU?
  • When does asynchronous operations occur?

Common GPU Development Model

							digraph {
								layout="neato"
								mode="sgd"
								beautify=true
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.2]
								edge [len=2]

								Client [label="You on your laptop" pos="0,1"]
								Job [label="Specify job on Mogon" pos="2,2"]
								Mogon [label="Server with GPUs" pos="4,1"]
								Result [label="Downlod from Mogon" pos="2,0"]

								Client -> Job;
								Job -> Mogon;
								Mogon -> Result;
								Result -> Client;
							}
						

Two-Phase Profiling

							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.25]

								target[label="Record Profiling Data on target\n\nnsys profile ...\nncu ..."]
								client[label="Analyse profiling data on client\n\nnsys-ui\nncu-ui"]

								target -> client [label="Copy profiling data to client\n\nssh, scp, smb, ...", len=5]

							}
						

Preparing for Profiling

Host Code Annotations

You can use the NVIDIA Tools Extensions to annotate your code for profiling

#include <nvtx3/nvtx3.hpp>

Example

						
						#include <nvtx3/nvtx3.hpp>
						void some_function() {
							NVTX3_FUNC_RANGE();  // Range around the whole function
							for (int i = 0; i < 6; ++i) {
								nvtx3::scoped_range loop{"loop range"};  // Range for iteration

								// Make each iteration last for one second
								std::this_thread::sleep_for(std::chrono::seconds{1});
							}
						}
						
					

Correctness

  • Subtle errors that do not cause your kernel to terminate under normal conditions can cause errors with profiling
    • esp. writing outside of allocated memory
  • Run your code with compute-sanitizer if profiling crashes or misbehaves
    • Automatically instruments for example for bad memory behaviour
    • Causes something like 100x slowdown, so try small datasets first
    • Fix any errors that come up, then profile again

						compute-sanitizer --tool memcheck  ./my-cuda-binary
						compute-sanitizer --tool racecheck ./my-cuda-binary
						compute-sanitizer --tool initcheck ./my-cuda-binary
						compute-sanitizer --tool synccheck ./my-cuda-binary
					

Compiling

  • Compile device code with optimizations
    • non-optimized or debug code often has many more memory references
    • nvcc by default applies many optimizations to device code
    • remove any -G, -pg, -g flags
  • Compile device code with line number annotations
    • add -lineinfo flag to all nvcc calls
    • puts some info in the binary about what source file locations generated what machine code

$ nvcc -G main.cu$ nvcc -lineinfo main.cu

Caveats

Profiling affects the performance of your kernel!

It will help you improve the speed, but do not report the time during profiling as the performance of your code. Always run and time without profiling.

Kernel Profiling with Nsight Compute

NVIDIA Nsight Compute

  • Record and analyse detailed kernel performance metrics
  • Two interfaces:
    GUI
    ncu-ui
    CLI
    ncu
  • Directly consuming 1000 metrics is challenging, we use the GUI to help
  • Use a two-part record-then-analyse flow with Supercomputer-Clusters
							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.35]

								cli[label="Record data on target platform", xlabel="ncu"]
								gui[label="Analyse data on client", xlabel="ncu-ui"]

								cli -> gui [label="download", len=10]

							}
						

Kernel Profiling

  • Device has many performance counters to record detailed information
    • Made available as "metrics".
    • To get a list: ncu --devices 0 --query-metrics

Record kernel traces


					$ ncu                             \
					--kernel-id ::my_kernel:6         \
					--section ".*"                    \
					-o my_kernel_%h_$(date "+%F")_%i  \
					./my_kernel                        
					
--kernel-id ::my_kernel:6: Profile the 6th time the "my_kernel" kernel runs (i.e. 5 warmups) --section ".*": Record metrics for all report sections -o my_kernel_%h_$(date "+%F")_%i: Creates a report file named after the host, the date and an ID ./my_kernel: Name of the executable to profile

Nsight Compute Sections

Sections are a group of related measurements.

The default list can be generated by $ ncu --list-sections

--sections ".*" provides a regex which selects all sections instead.

  • Tabs
  • Button to add Baseline
  • After adding a baseline, the other tab compares it measurements with it.

Speed of Light

The speed of light (SOL) is a measure of how much of the GPU's capabilities have been used. This means that an algorithm with 100% SOL used all the memory bandwidth, all the threads, there was no waiting, and so on.

Workload Memory Analysis

Memory Chart

Global Memory
shared by all threads
Local Memory
private per-thread
Shared Memory
shared by threads in a block
Texture/Surface
Cached for 2D spatial locality
Constant
Cached in the constant cache

Workload Memory Analysis: Charts

  • Detailed information summarized
  • TEX means the first-level cache

Scheduler Statistics

Theoretical Warps
Pool of warps that the scheduler can pick from. Limited by device.
Active Warps
Number of warps actually given to SM: not enough work, or work imbalance
Eligible Warps
Number of warps ready to execute: waiting for barrier, watching for instruction fetch, waiting for data…
Issued Warps
Number of issued warps: usually maximum of 1 or 2 depending on hardware.

Just because average value is good, doesn’t mean warp scheduling chances are missed

Warp State Statistics

Warp cycles per issued instructions
latency between two consecutive instructions.

More latency → More warp parallelism needed to hide.

Warp State
average number of cycles spent in that state for each instruction

Stalls cannot always be avoided and only really matter if instructions can’t be issued every cycle

Instruction Hotspots

Switch to the "Source" page.

Show various metrics correlated with source code lines and PTX instructions

Some source code lines create many many PTX instructions: sometimes, split up a source line into many lines to get more details

If profiling on a different system, source file may not automatically load since paths may not match.

Click "resolve" and find your local copy of the code that was compiled or run remotely

Instruction Sampling

  • Every so often, the position of the program counter is recorded
  • Slower instructions are more likely to be recorded
  • There will be many samples in slow parts of the code, and few in fast parts of the code

Hotspots

The program counter spends most of its time on instructions from this line. Mouse over for breakdown.

When optimizing code, focus on the parts that are runtime intensive and/or use a lot of registers.

System Profiling with Nsight Systems

NVIDIA Nsight Systems

  • Deliver work to the GPU effectively
    • Understand performance of surrounding system
  • Two interfaces:
    GUI
    nsys-ui
    CLI
    nsys
  • Again, use a two-part record-then-analyse flow with Supercomputer-Clusters
							digraph {
								bgcolor="transparent"
								fontname="Noto Sans"
								node [fontname="Noto Sans"]
								edge [fontname="Noto Sans"]
								rankdir=LR
								node [shape="box" style="rounded" margin=0.35]

								cli[label="Record data on target platform", xlabel="nsys"]
								gui[label="Analyse data on client", xlabel="nsys-ui"]

								cli -> gui [label="download", len=10]

							}
						

Record kernel traces


					$ nsys profile                    \
					-o my_kernel_%h_$(date "+%F")_%i  \
					./my_kernel                        
					
nsys profile: Tells nsys to profile the executable. -o my_kernel_%h_$(date "+%F")_%i: Creates a report file named after the host, the date and an ID ./my_kernel: Name of the executable to profile

Kernel Time vs. Wall Time

  • CPU activity
  • GPU activity

Overlap

No overlap of transfer and kernel (3.5 ms)

Overlap of transfer and kernel! (2.5ms)

Not Discussed

  • Measuring across multiple streams with CUDA events
  • Profiling through the Nsight Compute GUI
  • Profiling through the Nsight Systems GUI
  • In-kernel timing with clock()/clock64()
  • Custom profiling hooks with CUDA Performance Tools Interface (CUPTI)