Based on slides by Carl Pearson of University of Illinois
Kernel-Level Profiling
System-Level Profiling
digraph { layout="neato" mode="sgd" beautify=true bgcolor="transparent" fontname="Noto Sans" node [fontname="Noto Sans"] rankdir=LR node [shape="box" style="rounded" margin=0.2] edge [len=2] Client [label="You on your laptop" pos="0,1"] Job [label="Specify job on Mogon" pos="2,2"] Mogon [label="Server with GPUs" pos="4,1"] Result [label="Downlod from Mogon" pos="2,0"] Client -> Job; Job -> Mogon; Mogon -> Result; Result -> Client; }
digraph { bgcolor="transparent" fontname="Noto Sans" node [fontname="Noto Sans"] edge [fontname="Noto Sans"] rankdir=LR node [shape="box" style="rounded" margin=0.25] target[label="Record Profiling Data on target\n\nnsys profile ...
\nncu ...
"] client[label="Analyse profiling data on client\n\nnsys-ui
\nncu-ui
"] target -> client [label="Copy profiling data to client\n\nssh, scp, smb, ...", len=5] }
You can use the NVIDIA Tools Extensions to annotate your code for profiling
#include <nvtx3/nvtx3.hpp>
Example
#include <nvtx3/nvtx3.hpp>
void some_function() {
NVTX3_FUNC_RANGE(); // Range around the whole function
for (int i = 0; i < 6; ++i) {
nvtx3::scoped_range loop{"loop range"}; // Range for iteration
// Make each iteration last for one second
std::this_thread::sleep_for(std::chrono::seconds{1});
}
}
compute-sanitizer
if profiling crashes or misbehaves
compute-sanitizer --tool memcheck ./my-cuda-binary
compute-sanitizer --tool racecheck ./my-cuda-binary
compute-sanitizer --tool initcheck ./my-cuda-binary
compute-sanitizer --tool synccheck ./my-cuda-binary
-G, -pg, -g
flags-lineinfo
flag to all nvcc calls$ nvcc -G main.cu
$ nvcc -lineinfo main.cu
Profiling affects the performance of your kernel!
It will help you improve the speed, but do not report the time during profiling as the performance of your code. Always run and time without profiling.
ncu-ui
ncu
digraph { bgcolor="transparent" fontname="Noto Sans" node [fontname="Noto Sans"] edge [fontname="Noto Sans"] rankdir=LR node [shape="box" style="rounded" margin=0.35] cli[label="Record data on target platform", xlabel="ncu"] gui[label="Analyse data on client", xlabel="ncu-ui"] cli -> gui [label="download", len=10] }
ncu --devices 0 --query-metrics
$ ncu \
--kernel-id ::my_kernel:6 \
--section ".*" \
-o my_kernel_%h_$(date "+%F")_%i \
./my_kernel
--kernel-id ::my_kernel:6
: Profile the 6th time the "my_kernel" kernel
runs (i.e. 5 warmups)
--section ".*"
: Record metrics for all report sections
-o my_kernel_%h_$(date "+%F")_%i
: Creates a report file named after the
host, the date and
an
ID
./my_kernel
: Name of the executable to profile
Sections are a group of related measurements.
The default list can be generated by $ ncu --list-sections
--sections ".*"
provides a regex which selects all
sections instead.
The speed of light (SOL) is a measure of how much of the GPU's capabilities have been used. This means that an algorithm with 100% SOL used all the memory bandwidth, all the threads, there was no waiting, and so on.
Just because average value is good, doesn’t mean warp scheduling chances are missed
More latency → More warp parallelism needed to hide.
Stalls cannot always be avoided and only really matter if instructions can’t be issued every cycle
Switch to the "Source" page.
Show various metrics correlated with source code lines and PTX instructions
Some source code lines create many many PTX instructions: sometimes, split up a source line into many lines to get more details
If profiling on a different system, source file may not automatically load since paths may not match.
Click "resolve" and find your local copy of the code that was compiled or run remotely
The program counter spends most of its time on instructions from this line. Mouse over for breakdown.
When optimizing code, focus on the parts that are runtime intensive and/or use a lot of registers.
nsys-ui
nsys
digraph { bgcolor="transparent" fontname="Noto Sans" node [fontname="Noto Sans"] edge [fontname="Noto Sans"] rankdir=LR node [shape="box" style="rounded" margin=0.35] cli[label="Record data on target platform", xlabel="nsys"] gui[label="Analyse data on client", xlabel="nsys-ui"] cli -> gui [label="download", len=10] }
$ nsys profile \
-o my_kernel_%h_$(date "+%F")_%i \
./my_kernel
nsys profile
: Tells nsys
to profile the executable.
-o my_kernel_%h_$(date "+%F")_%i
: Creates a report file named after the
host, the date and an ID
./my_kernel
: Name of the executable to profile