C++ parallel average matrix with OpenMP

2017-07-08

An answer to this question on Stack Overflow.

Question

I have this code in C++ that calculates the average of each column of a matrix. I want to parallelize the code using OpenMP.

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
vector<double> average(const vector<vector<unsigned char>>& original){
  vector<vector<double>> result(original.size(), vector<double>(original[0].size()));
  vector<double> average(original[0].size(), 0.0);
  
  for (int i=0; i<original.size(); i++) {
    const vector<unsigned char>& vector = original[i];
    for (int k = 0; k < vector.size(); ++k) {
      average[k] += vector[k];
    }
  }
  for (double& val : average) {
    val /= original.size();
  }
  
  return average;
}

Adding #pragma omp parallel for before the outer for loop gets me bogus results. Do you have any pointers? I thought I would find tons of examples of this online but wasn't able to find much. This is my first time using OpenMP.

Answer

Frank is right in saying that your immediate problem may be that you're using a non-atomic operation:

average[k] += vector[k];

You can fix that by using:

#pragma omp atomic
average[k] += vector[k];

But a larger conceptual problem is that this probably will not speed up your code. The operations you are doing a very simple and your memory (at least the rows) are contiguous.

Indeed, I have made a Minimum Working Example for your code (you should have done this for your question):

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
vector<float> average(const vector<vector<unsigned char>>& original){
  vector<float> average(original[0].size(), 0.0);
  #pragma omp parallel for
  for (int i=0; i<original.size(); i++) {
    const vector<unsigned char>& vector = original[i];
    for (int k = 0; k < vector.size(); ++k) {
      #pragma omp atomic
      average[k] += vector[k];
    }
  }
  for (float& val : average) {
    val /= original.size();
  }
  return average;
}
int main(){
  vector<vector<unsigned char>> mat(1000);
  for(int y=0;y<mat.size();y++)
  for(int x=0;x<mat.size();x++)
    mat.at(y).emplace_back(rand()%255);
  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  double dont_optimize = 0;
  for(int i=0;i<100;i++){
    auto ret = average(mat);
    dont_optimize += ret[0];
  }
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
  std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;
  return 0;
}

Compiling this with g++ -O3 temp.cpp -fopenmp enables OpenMP. Run-times on my four-core machine are consistently about 10,247 microseconds. When I disable OpenMP, run-times are about 2,561 microseconds.

Starting and managing a thread team is expensive.

But there is a real way to speed up your code: improve your memory layout.

Using a std::vector< std::vector<T> > design means that each vector<T> can be located anywhere in memory. Rather, we would like all our memory nice and contiguous. We can achieve this by using flat-array indexing, like so:

(Note that an earlier version of the below code used, e.g., mat.at(y*width+x). The range checking this implies results in a significant loss of speed versus using mat[y*width+x], as the code now does. Times have been updated appropriately.)

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
class Matrix {
 public:
  vector<unsigned char> mat;
  int width;
  int height;
  Matrix(int width0, int height0){
    width  = width0;
    height = height0;
    for(int i=0;i<width*height;i++)
      mat.emplace_back(rand()%255);
  }
  unsigned char& operator()(int x, int y){
    return mat[y*width+x];
  }
  unsigned char operator()(int x, int y) const {
    return mat[y*width+x];
  }
  unsigned char& operator()(int i){
    return mat[i];
  }
  unsigned char operator()(int i) const {
    return mat[i];
  }
};
vector<float> average(const Matrix& original){
  vector<float> average(original.width, 0.0);
  #pragma omp parallel for
  for(int y=0;y<original.height;y++)
  for(int x=0;x<original.width;x++)
    #pragma omp atomic
    average[x] += original(x,y);
  for (float& val : average) 
    val /= original.height;
  return average;
}
int main(){
  Matrix mat(1000,1000);
  std::cerr<<mat.width<<" "<<mat.height<<std::endl;
  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  double dont_optimize = 0;
  for(int i=0;i<100;i++){
    auto ret = average(mat);
    dont_optimize += ret[0];
  }
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
  std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;
  return 0;
}

Note that I am also using float instead of double: you can cram twice the numbers into the same amount of space this way, which is good for caching.

This gives run-times of 292 microseconds without OpenMP and 9426 microseconds with OpenMP.

In conclusion, using OpenMP/parallelism slows down your code because the work being done in parallel takes less time than setting up the parallelism, but using a better memory layout gives a ~90% increase in speed. In addition, using the handy Matrix class I create improves your code's readability and maintainability.

Edit:

Running this on matrices that are 10,000x10,000 instead of 1,000x1,000 gives similar results. For the vector of vectors: 7,449 microseconds without OpenMP and 156,316 microseconds with OpenMP. For flat-array indexing: 32,668 miroseconds without OpenMP and 145,470 microseconds with OpenMP.

The performance may be related to the hardware instruction set available on my machine (particularly, if my machine lacks atomic instructions then OpenMP will have to simulate them with mutexes and such). Indeed, in the flat-array example compiling with -march=native gives improved, though still not-great, performance for OpenMP: 33,079 microseconds without OpenMP and 127,841 microseconds with OpenMP. I will experiment with a more powerful machine later.

Edit

While the aforementioned testing was performed on the Intel(R) Core(TM) i5 CPU M 480 @ 2.67GHz, I have compiled this code (with -O3 -march=native) on the badass Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. The results are similar:

1000x1000, Vector of vectors, without OpenMP: 145μs
1000x1000, Vector of vectors, with OpenMP: 2,941μs
10000x10000, Vector of vectors, without OpenMP: 20,254μs
10000x10000, Vector of vectors, with OpenMP: 85,703μs
1000x1000, Flat array, without OpenMP: 139μs
1000x1000, Flat array, with OpenMP: 3,171μs
10000x10000, Flat array, without OpenMP: 18,712μs
10000x10000, Flat array, with OpenMP: 89,097μs

This confirms our earlier result: using OpenMP for this task tends to slow things down, even if your hardware is amazing. In fact, most of the speed-up between the two processors is likely due to the Xeon's large L3 cache size: at 30,720K it's 10x larger than the 3,720K cache on the i5.

Edit

Incorporating Zulan's reduction strategy from their answer below allows us to efficiently leverage parallelism:

vector<float> average(const Matrix& original){
  vector<float> average(original.width, 0.0);
  auto average_data = average.data();
  #pragma omp parallel for reduction(+ : average_data[ : original.width])
  for(int y=0;y<original.height;y++){
    for(int x=0;x<original.width;x++)
      average_data[x] += original(x,y);
  }
  for (float& val : average) 
    val /= original.height;
  return average;
}

For 24 threads, this gives 2629 microseconds on the 10,000x10,000 arrays: a 7.1x improvement over the serial version. Using Zulan's strategy on your original code (without the flat-array indexing) gives 3529 microseconds, so we're still getting a 25% speed-up by using better layouts.