How can I debug code 700 "illegal memory access" aka `CUDA_EXCEPTION_14, Warp Illegal Address`?

2023-12-09

An answer to this question on Stack Overflow.

Question

My code is showing

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
 0# my_func(signed char const*, unsigned char const*, int*, int*, int, int) in libthing.so

How can I debug this?

Answer

Debugging illegal memory access / Warp Illegal Address

Here's a comprehensive guide to discovering what will probably be a stupid mistake. The first two steps are somewhat superfluous because we'll ultimately use a debugger, but they're very good ideas for helping isolate a variety of problems.

Step 1. Make sure you're checking all CUDA API calls for errors on the host, including kernel launches.

My output looked like this

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489

where Line 489 is the first time I checked a CUDA API call for errors. This has almost nothing to do with where the error happened.

Since we can also see CUDA errors from the GPU if we check with them, we need to make sure our code checks for them any time we interact with the GPU.

To do so, I use the following header file to define error-checking functions:

#pragma once
#include <boost/stacktrace.hpp>
#include <iostream>
#include <stdexcept>
#define STRINGIZE_DETAIL(x) #x
#define STRINGIZE(x) STRINGIZE_DETAIL(x)
#define CUDA_CHECK(call)                                                                                               \
  do {                                                                                                                 \
    if ((call) != cudaSuccess) {                                                                                       \
      const cudaError_t err = cudaGetLastError();                                                                      \
      std::cerr << "CUDA error calling \"" #call "\", code is " << err << " " << cudaGetErrorString(err) << " on "     \
                << __LINE__ << "\n"                                                                                    \
                << boost::stacktrace::stacktrace() << std::endl;                                                       \
      throw std::runtime_error("Problem.");                                                                            \
    }                                                                                                                  \
  } while (0)
#define CUDA_KERNEL_LAUNCH_CHECK()                                                                                     \
  do {                                                                                                                 \
    const auto cuda_err = cudaGetLastError();                                                                          \
    if (cuda_err != cudaSuccess) {                                                                                     \
      throw std::runtime_error(std::string("CUDA kernel launch failed! ") + cudaGetErrorString(cuda_err));             \
    }                                                                                                                  \
  } while (0)

If your code looked like this beforehand:

cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice);
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);

after adding error checking it will look like

CUDA_CHECK(cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice));
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
CUDA_KERNEL_LAUNCH_CHECK();

Note that you want to have a CUDA_KERNEL_LAUNCH_CHECK() immediately following every kernel launch.

Now that your code is checking for errors you'll see problems closer to where they happened.

2. Use CUDA_LAUNCH_BLOCKING

Run your program with

CUDA_LAUNCH_BLOCKING=1 ./my_program.exe

CUDA_LAUNCH_BLOCKING will cause each kernel to finish running before moving on to the next line. Since every kernel launch is followed by CUDA_KERNEL_LAUNCH_CHECK() now this will tell you exactly which kernel caused the problem.

3. Compile in debug mode

Compile your code like

nvcc -g G my.cu source.cu files.cu

This ensures that you'll have access to the CUDA source code inside the debugger and be able to step through it.

4. Use cuda-gdb

Run your code like so.

cuda-gdb --silent --ex run --args ./my_program.exe

whereas running the program without the debugger gave an address in the C++ host source:

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489

running the program in cuda-gdb gives:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x10000320778 (cuda.cu:414)
Thread 1 "benchmark.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 15, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x000001000031f450 in my_kernel<<<(1,1,1),(128,1,1)>>> (rec=0x7fffc7c00000 '\377' <repeats 200 times>, arg1=0x7fffce800000, arg2=0x7fffc7de8a00, 
    arg3=0x555555a7a9a0, arg4=0x7fffc7cf4400 "", arg5=1000, arg6=1000) at my_source_file.cu:382
382	      const auto ci = order[i];

The debugger has identified that Line 414 is the problem - and it is!

That lines looks like this

array[i] = val;

we suspect that the assignment is causing a problem because that involves an address. But why?

Notes that all of the arguments have pointers beginning with 0x7fffc except for argument 5 which is 0x555555a7a9a0. This isn't hexadecimal garbage - it's actually the source of the problem. CUDA hasn't allocated a single block of memory at 0x555555a7a9a0 far distant from all the other addresses. Instead, a host pointer has been passed to the device and that is causing the failure. This is why the values of the pointers are so different.

Note that although the debugger appears to have stopped at

382	      const auto ci = order[i];

this line has nothing to do with the problem. I'm not sure why the compiler stopped here.