Writing code on the CPU while developing, running it on the GPU when live - which approach?

2020-02-24

c++ gpu scicomp software software-recommendation stackexchange

An answer to this question on the Scientific Computing Stack Exchange.

Question

In my simulations I am using dense matrix-vector multiplications and 2D-fft transformations quite often, for matrix sizes of 8kx8k and up. Hence, I assume that using a GPU is beneficial for speeding up my code.
The problem is, though, that my development PC does not have an external GPU, and does not support the addition of one. Buying a new PC for development which can support a GPU is currently not possible. Therefore, my current approach is to use ArrayFire, which allows to switch the backend depending on the library I link, and thereby allowing me to write and run code on the CPU for testing, but still making it possible to switch to CUDA/OpenCL during production.

Nevertheless, I was wondering if there are other, maybe better alternatives? I looked at kokkos, for example, but there I would have to write my own wrapper for FFTs.
Or should I rather switch to a completely different approach for solving those problems?

Edit: The code was written in C++, thus I'd like to avoid having to switch to other languages.

Answer

ArrayFire has a C++ API as well as a Python API. You can switch between several backends including CPU, CUDA, and OpenCL. It will also handle memory movement and kernel fusion for you. An example:

/*******************************************************
 * Copyright (c) 2014, ArrayFire
 * All rights reserved.
 *
 * This file is distributed under 3-clause BSD license.
 * The complete license agreement can be obtained at:
 * http://arrayfire.com/licenses/BSD-3-Clause
 ********************************************************/
 
#include <arrayfire.h>
#include <math.h>
#include <stdio.h>
#include <cstdlib>
 
using namespace af;
 
// create a small wrapper to benchmark
static array A;  // populated before each timing
static void fn() {
    array B = fft2(A);  // matrix multiply
    B.eval();           // ensure evaluated
}
 
int main(int argc, char** argv) {
    try {
        setBackend(AF_BACKEND_CPU);
        //setBackend(AF_BACKEND_CUDA); //Choose one!
        info();
 
        printf("Benchmark N-by-N 2D fft\n");
        for (int M = 7; M <= 12; M++) {
            int N = (1 << M);
 
            printf("%4d x %4d: ", N, N);
            A             = randu(N, N);
            double time   = timeit(fn);  // time in seconds
            double gflops = 10.0 * N * N * M / (time * 1e9);
 
            printf(" %4.0f Gflops\n", gflops);
            fflush(stdout);
        }
    } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); }
 
    return 0;
}