Why is having a single process sort a list so much faster than having many processes sort separate lists of equal size?

2018-06-08

c++ multiprocessing openmp parallel-processing stackexchange stackoverflow

An answer to this question on Stack Overflow.

Question

I have 64 cores on a single machine running sorting on a total of 1GB of data. They each sort 156,250 items, and shouldn't be sharing any data structures (i.e. there are a total of 64 separate arrays being sorted). However the more cores I have running, the slower each core is at its own sorting task.

The time measurement is being done as such:

void sort_ranges(std::vector<std::vector<std::vector<int> > > & range_partitions, int num_workers, std::string filename, std::string outfile)
{
  #pragma omp parallel default(none) shared(range_partitions, outfile, num_workers)
  {
    int i = omp_get_thread_num();
    std::vector<int> data_vec; //Data copied into separate data structure for each thread
    for(int x = 0; x < num_workers; x ++) {
      data_vec.reserve(data_vec.size() + (range_partitions[x][i]).size());
      data_vec.insert(data_vec.end(), range_partitions[x][i].begin(), range_partitions[x][i].end());
    }
    int n = data_vec.size();
    int * data = &data_vec[0];
    double start = omp_get_wtime();
    std::sort(data, data + n); //Measure sort function call
    double sort_done = omp_get_wtime() - start;
  }
}

When I run on 1GB of data, each process sorts a size 156,250 array and takes about 10 seconds. Obviously this is ridiculously slow. If I run one process that sorts a size 156,250 array, the process takes < 0.1 seconds to sort.

I'm really confused by this, because each process is running on a different array, so there's no reason having more cores running an identical task should slow down all the other cores.

I think there's something about how memory is managed that I'm missing. Any help is appreciated!

I realize there are a lot of different costs for increased parallelism, e.g. process overhead or working on shared memory, however I am specifically concerned with the slowdown of the std::sort() function called on separate data structures for each thread

Answer

You didn't include a minimum working example with your question, so I was not able to reproduce your issue.

I agree with other folks that likely what you're seeing is that using too many cores to do the sort results in cache thrashing, though I haven't been able to prove that based on my own tests.

When the CPU reads data from memory it doesn't just read one byte. It reads many bytes. These are stored in a cache for quick access. Caches are hierarchical and shared to greater or lesser degrees between processors, like so:

[ Cache hierarchy ]1

As you can see, the cores all share the L3 cache. If the memory addresses the cores are operating on are distant from each other, the cores will have limited cache overlap and compete to utilize the cache.

Verifying whether or not this is happening in your code is easy (at least, if you have Linux). You can use the perf command to gather data about what your program is doing.

At the bottom of this question, I include a MWE of what I think you're asking about. I then gather statistics about the behaviour of the MWE using the following perf command.

perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,l2_rqsts.miss,LLC-load-misses,LLC-loads,LLC-prefetch-misses,LLC-store-misses,LLC-stores ./a.out m

This results in the following for single-threaded operation:

18,676,838      cache-misses              #   69.492 % of all cache refs      (27.28%)
    26,876,349      cache-references                                              (36.38%)
   143,224,257      L1-dcache-load-misses     #    1.65% of all L1-dcache hits    (36.39%)
 8,682,532,168      L1-dcache-loads                                               (36.40%)
 4,130,005,905      L1-dcache-stores                                              (36.40%)
    92,709,572      l2_rqsts.miss                                                 (36.40%)
     2,409,977      LLC-load-misses           #   34.83% of all LL-cache hits     (36.39%)
     6,919,668      LLC-loads                                                     (36.37%)
    23,562,449      LLC-prefetch-misses                                           (18.16%)
    16,038,395      LLC-store-misses                                              (18.19%)
    79,580,399      LLC-stores                                                    (18.18%)
  24.578381342 seconds time elapsed

And for running with four threads:

21,357,447      cache-misses              #   74.720 % of all cache refs      (23.99%)
    28,583,269      cache-references                                              (33.10%)
   160,265,596      L1-dcache-load-misses     #    1.85% of all L1-dcache hits    (35.91%)
 8,670,516,235      L1-dcache-loads                                               (36.52%)
 4,131,943,678      L1-dcache-stores                                              (36.50%)
   102,495,289      l2_rqsts.miss                                                 (36.50%)
     2,768,956      LLC-load-misses           #   38.05% of all LL-cache hits     (32.91%)
     7,277,568      LLC-loads                                                     (31.23%)
    29,220,858      LLC-prefetch-misses                                           (15.36%)
    18,920,533      LLC-store-misses                                              (15.26%)
   104,834,221      LLC-stores                                                    (14.85%)
  10.334248457 seconds time elapsed

As you can see, running with four threads did result in more cache misses for me. This may not be a statistically-significant increase; I didn't run multiple times to check. However, unlike you, I see improved performance with more threads.

To simulate cache contention, I can oversubscribe my CPU by using more threads than cores. To do so, I set the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=32

With 32 threads, I see:

Performance counter stats for './a.out m':

24,222,105      cache-misses              #   77.175 % of all cache refs      (23.39%)
    31,385,891      cache-references                                              (32.47%)
   161,353,805      L1-dcache-load-misses     #    1.87% of all L1-dcache hits    (35.27%)
 8,618,074,931      L1-dcache-loads                                               (36.70%)
 4,131,633,620      L1-dcache-stores                                              (36.28%)
   107,094,632      l2_rqsts.miss                                                 (36.21%)
     5,299,670      LLC-load-misses           #   56.36% of all LL-cache hits     (31.93%)
     9,403,090      LLC-loads                                                     (29.02%)
    46,500,188      LLC-prefetch-misses                                           (15.09%)
    20,131,861      LLC-store-misses                                              (14.26%)
   105,310,438      LLC-stores                                                    (14.15%)
  10.379022550 seconds time elapsed

Notice that our LLC-load-misses (last level cache) has gone from 34% to 38% to 56% as the number of threads increases. Speed, however, is not much affected. This may be because the data doesn't have good cache locality to begin with.

Regardless, this is one way to study your problem. If you want better help than this, you'll have to make an MWE of your own.

You can alleviate some of the cache contention by reducing the number of threads you are using and specifying their affinity so that threads do not share the same L2/L3 caches (depending on your processor). More information is here.

Minimum Working Example

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
typedef std::vector< std::vector<int> > data_t;
data_t GenData(std::mt19937 &mt_rand, int vec_num, int vec_len){
  data_t data;
  data.reserve(vec_num);
  for(unsigned int i=0;i<vec_num;i++){
    data.emplace_back();
    data.back().reserve(vec_len);
    for(unsigned int i=0;i<vec_len;i++)
      data.back().emplace_back(mt_rand());
  }
  return data;
}
void SortSingle(data_t &data){
  for(auto &v: data)
    std::sort(v.begin(),v.end());
}
void SortMulti(data_t &data){
  #pragma omp parallel for default(none) shared(data)
  for(unsigned int i=0;i<data.size();i++)
    std::sort(data[i].begin(), data[i].end());
}
int main(int argc, char **argv){
  std::mt19937 mt_rand;
  typedef std::chrono::high_resolution_clock clock;
  std::cout<<"Generating data..."<<std::endl;
  auto data = GenData(mt_rand,1600,156250);
  std::cout<<"Sorting data..."<<std::endl;
  const auto start_time = clock::now();
  if(argv[1][0]=='s')
    SortSingle(data);
  else if (argv[1][0]=='m')
    SortMulti(data);
  else
    std::cout<<"Unknown sort type!"<<std::endl;
  const auto end_time = clock::now();
  const auto time_diff = std::chrono::duration_cast<std::chrono::duration<double>>(end_time - start_time).count();
  std::cout<<"Time = "<<time_diff<<"s"<<std::endl;
  return 0;
}