Skip to content

SIMD optimisation for cross-pattern access

An answer to this question on Stack Overflow.

Question

I'm tryint to write a monte-carlo simulation of the Ising model, and I was wondering if it was possible to use SIMD optimisations for accessing data in a cross pattern.

I basically want to know if there's any way of speeding up this function.

//up/down/left/right stencil accumulation
float lattice::compute_point_energy(int row, int col) {
  int accumulator=0;
  accumulator+= get(row? row-1: size_-1, col);
  accumulator+= get((row+1)%size_, col);
  accumulator+= get(row, col? col-1: size_-1);
  accumulator+= get(row, (col+1)%size_) ;
  return -get(row, col) * (accumulator * J_ + H_);
}

get(i, j) is a method accesses a flat std::vector of shorts. I see that there might be a few problems: the access has lots of ternary logic going on (for periodic boundary conditions), and none of the vector elements are adjacent. Is it to make SIMD optimisations for this chunk, or should I keep digging? Re-implementing the adjacency matrix and/or using a different container (e.g. an array, or vector of different type) are an option.

Answer

SIMD is the last thing you'll want to try with this function.

I think you're trying to use an up/down/left/right 4-stencil for your computation. If so, your code should have a comment noting this.

You're losing a lot of speed in this function because of the potential for branching at your ternary operators and because modulus is relatively slow.

You'd do well to surround the two-dimensional space you're operating over with a ring of cells set to values appropriate for handling edge effects. This allows you to eliminate checks for edge effects.

For accessing your stencil, I find it often works to use something like the following:

const int width  = 10;
const int height = 10;
const int offset[4] = {-1,1,-width,width};
double accumulator=0;
for(int i=0;i<4;i++)
  accumulator += get(current_loc+offset[i]);

Notice that the mini-array has precalculated offsets to the neighbouring cells in your domain. A good compiler will likely unroll the foregoing loop.

Once you've done all this, appropriate choice of optimization flags may lead to automatic vectorization.

As it is, the branching and mods in your code are likely preventing auto-vectorization. You can check this by enabling appropriate flags. For Intel Compiler Collection (icc), you'll want:

-qopt-report=5 -qopt-report-phase:vec

For GCC you'll want (if I recall correctly):

-fopt-info-vec -fopt-info-missed