How can I debug code 700 "illegal memory access" aka `CUDA_EXCEPTION_14, Warp Illegal Address`?
An answer to this question on Stack Overflow.
Question
My code is showing
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
0# my_func(signed char const*, unsigned char const*, int*, int*, int, int) in libthing.so
How can I debug this?
Answer
Debugging illegal memory access / Warp Illegal Address
Here's a comprehensive guide to discovering what will probably be a stupid mistake. The first two steps are somewhat superfluous because we'll ultimately use a debugger, but they're very good ideas for helping isolate a variety of problems.
Step 1. Make sure you're checking all CUDA API calls for errors on the host, including kernel launches.
My output looked like this
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
where Line 489 is the first time I checked a CUDA API call for errors. This has almost nothing to do with where the error happened.
Since we can also see CUDA errors from the GPU if we check with them, we need to make sure our code checks for them any time we interact with the GPU.
To do so, I use the following header file to define error-checking functions:
#pragma once
#include <boost/stacktrace.hpp>
#include <iostream>
#include <stdexcept>
#define STRINGIZE_DETAIL(x) #x
#define STRINGIZE(x) STRINGIZE_DETAIL(x)
#define CUDA_CHECK(call) \
do { \
if ((call) != cudaSuccess) { \
const cudaError_t err = cudaGetLastError(); \
std::cerr << "CUDA error calling \"" #call "\", code is " << err << " " << cudaGetErrorString(err) << " on " \
<< __LINE__ << "\n" \
<< boost::stacktrace::stacktrace() << std::endl; \
throw std::runtime_error("Problem."); \
} \
} while (0)
#define CUDA_KERNEL_LAUNCH_CHECK() \
do { \
const auto cuda_err = cudaGetLastError(); \
if (cuda_err != cudaSuccess) { \
throw std::runtime_error(std::string("CUDA kernel launch failed! ") + cudaGetErrorString(cuda_err)); \
} \
} while (0)
If your code looked like this beforehand:
cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice);
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
after adding error checking it will look like
CUDA_CHECK(cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice));
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
CUDA_KERNEL_LAUNCH_CHECK();
Note that you want to have a CUDA_KERNEL_LAUNCH_CHECK() immediately following every kernel launch.
Now that your code is checking for errors you'll see problems closer to where they happened.
2. Use CUDA_LAUNCH_BLOCKING
Run your program with
CUDA_LAUNCH_BLOCKING=1 ./my_program.exe
CUDA_LAUNCH_BLOCKING will cause each kernel to finish running before moving on to the next line. Since every kernel launch is followed by CUDA_KERNEL_LAUNCH_CHECK() now this will tell you exactly which kernel caused the problem.
3. Compile in debug mode
Compile your code like
nvcc -g G my.cu source.cu files.cu
This ensures that you'll have access to the CUDA source code inside the debugger and be able to step through it.
4. Use cuda-gdb
Run your code like so.
cuda-gdb --silent --ex run --args ./my_program.exe
whereas running the program without the debugger gave an address in the C++ host source:
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
running the program in cuda-gdb gives:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x10000320778 (cuda.cu:414)
Thread 1 "benchmark.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 15, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x000001000031f450 in my_kernel<<<(1,1,1),(128,1,1)>>> (rec=0x7fffc7c00000 '\377' <repeats 200 times>, arg1=0x7fffce800000, arg2=0x7fffc7de8a00,
arg3=0x555555a7a9a0, arg4=0x7fffc7cf4400 "", arg5=1000, arg6=1000) at my_source_file.cu:382
382 const auto ci = order[i];
The debugger has identified that Line 414 is the problem - and it is!
That lines looks like this
array[i] = val;
we suspect that the assignment is causing a problem because that involves an address. But why?
Notes that all of the arguments have pointers beginning with 0x7fffc except for argument 5 which is 0x555555a7a9a0. This isn't hexadecimal garbage - it's actually the source of the problem. CUDA hasn't allocated a single block of memory at 0x555555a7a9a0 far distant from all the other addresses. Instead, a host pointer has been passed to the device and that is causing the failure. This is why the values of the pointers are so different.
Note that although the debugger appears to have stopped at
382 const auto ci = order[i];
this line has nothing to do with the problem. I'm not sure why the compiler stopped here.