Why is having a single process sort a list so much faster than having many processes sort separate lists of equal size?
An answer to this question on Stack Overflow.
Question
I have 64 cores on a single machine running sorting on a total of 1GB of data. They each sort 156,250 items, and shouldn't be sharing any data structures (i.e. there are a total of 64 separate arrays being sorted). However the more cores I have running, the slower each core is at its own sorting task.
The time measurement is being done as such:
void sort_ranges(std::vector<std::vector<std::vector<int> > > & range_partitions, int num_workers, std::string filename, std::string outfile)
{
#pragma omp parallel default(none) shared(range_partitions, outfile, num_workers)
{
int i = omp_get_thread_num();
std::vector<int> data_vec; //Data copied into separate data structure for each thread
for(int x = 0; x < num_workers; x ++) {
data_vec.reserve(data_vec.size() + (range_partitions[x][i]).size());
data_vec.insert(data_vec.end(), range_partitions[x][i].begin(), range_partitions[x][i].end());
}
int n = data_vec.size();
int * data = &data_vec[0];
double start = omp_get_wtime();
std::sort(data, data + n); //Measure sort function call
double sort_done = omp_get_wtime() - start;
}
}
When I run on 1GB of data, each process sorts a size 156,250 array and takes about 10 seconds. Obviously this is ridiculously slow. If I run one process that sorts a size 156,250 array, the process takes < 0.1 seconds to sort.
I'm really confused by this, because each process is running on a different array, so there's no reason having more cores running an identical task should slow down all the other cores.
I think there's something about how memory is managed that I'm missing. Any help is appreciated!
I realize there are a lot of different costs for increased parallelism, e.g. process overhead or working on shared memory, however I am specifically concerned with the slowdown of the std::sort() function called on separate data structures for each thread
Answer
You didn't include a minimum working example with your question, so I was not able to reproduce your issue.
I agree with other folks that likely what you're seeing is that using too many cores to do the sort results in cache thrashing, though I haven't been able to prove that based on my own tests.
When the CPU reads data from memory it doesn't just read one byte. It reads many bytes. These are stored in a cache for quick access. Caches are hierarchical and shared to greater or lesser degrees between processors, like so:
[![Cache hierarchy][1]][1]
As you can see, the cores all share the L3 cache. If the memory addresses the cores are operating on are distant from each other, the cores will have limited cache overlap and compete to utilize the cache.
Verifying whether or not this is happening in your code is easy (at least, if you have Linux). You can use the perf command to gather data about what your program is doing.
At the bottom of this question, I include a MWE of what I think you're asking about. I then gather statistics about the behaviour of the MWE using the following perf command.
perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,l2_rqsts.miss,LLC-load-misses,LLC-loads,LLC-prefetch-misses,LLC-store-misses,LLC-stores ./a.out m
This results in the following for single-threaded operation:
18,676,838 cache-misses # 69.492 % of all cache refs (27.28%)
26,876,349 cache-references (36.38%)
143,224,257 L1-dcache-load-misses # 1.65% of all L1-dcache hits (36.39%)
8,682,532,168 L1-dcache-loads (36.40%)
4,130,005,905 L1-dcache-stores (36.40%)
92,709,572 l2_rqsts.miss (36.40%)
2,409,977 LLC-load-misses # 34.83% of all LL-cache hits (36.39%)
6,919,668 LLC-loads (36.37%)
23,562,449 LLC-prefetch-misses (18.16%)
16,038,395 LLC-store-misses (18.19%)
79,580,399 LLC-stores (18.18%)
24.578381342 seconds time elapsed
And for running with four threads:
21,357,447 cache-misses # 74.720 % of all cache refs (23.99%)
28,583,269 cache-references (33.10%)
160,265,596 L1-dcache-load-misses # 1.85% of all L1-dcache hits (35.91%)
8,670,516,235 L1-dcache-loads (36.52%)
4,131,943,678 L1-dcache-stores (36.50%)
102,495,289 l2_rqsts.miss (36.50%)
2,768,956 LLC-load-misses # 38.05% of all LL-cache hits (32.91%)
7,277,568 LLC-loads (31.23%)
29,220,858 LLC-prefetch-misses (15.36%)
18,920,533 LLC-store-misses (15.26%)
104,834,221 LLC-stores (14.85%)
10.334248457 seconds time elapsed
As you can see, running with four threads did result in more cache misses for me. This may not be a statistically-significant increase; I didn't run multiple times to check. However, unlike you, I see improved performance with more threads.
To simulate cache contention, I can oversubscribe my CPU by using more threads than cores. To do so, I set the OMP_NUM_THREADS environment variable:
export OMP_NUM_THREADS=32
With 32 threads, I see:
Performance counter stats for './a.out m':
24,222,105 cache-misses # 77.175 % of all cache refs (23.39%)
31,385,891 cache-references (32.47%)
161,353,805 L1-dcache-load-misses # 1.87% of all L1-dcache hits (35.27%)
8,618,074,931 L1-dcache-loads (36.70%)
4,131,633,620 L1-dcache-stores (36.28%)
107,094,632 l2_rqsts.miss (36.21%)
5,299,670 LLC-load-misses # 56.36% of all LL-cache hits (31.93%)
9,403,090 LLC-loads (29.02%)
46,500,188 LLC-prefetch-misses (15.09%)
20,131,861 LLC-store-misses (14.26%)
105,310,438 LLC-stores (14.15%)
10.379022550 seconds time elapsed
Notice that our LLC-load-misses (last level cache) has gone from 34% to 38% to 56% as the number of threads increases. Speed, however, is not much affected. This may be because the data doesn't have good cache locality to begin with.
Regardless, this is one way to study your problem. If you want better help than this, you'll have to make an MWE of your own.
You can alleviate some of the cache contention by reducing the number of threads you are using and specifying their affinity so that threads do not share the same L2/L3 caches (depending on your processor). More information is here.
Minimum Working Example
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
typedef std::vector< std::vector<int> > data_t;
data_t GenData(std::mt19937 &mt_rand, int vec_num, int vec_len){
data_t data;
data.reserve(vec_num);
for(unsigned int i=0;i<vec_num;i++){
data.emplace_back();
data.back().reserve(vec_len);
for(unsigned int i=0;i<vec_len;i++)
data.back().emplace_back(mt_rand());
}
return data;
}
void SortSingle(data_t &data){
for(auto &v: data)
std::sort(v.begin(),v.end());
}
void SortMulti(data_t &data){
#pragma omp parallel for default(none) shared(data)
for(unsigned int i=0;i<data.size();i++)
std::sort(data[i].begin(), data[i].end());
}
int main(int argc, char **argv){
std::mt19937 mt_rand;
typedef std::chrono::high_resolution_clock clock;
std::cout<<"Generating data..."<<std::endl;
auto data = GenData(mt_rand,1600,156250);
std::cout<<"Sorting data..."<<std::endl;
const auto start_time = clock::now();
if(argv[1][0]=='s')
SortSingle(data);
else if (argv[1][0]=='m')
SortMulti(data);
else
std::cout<<"Unknown sort type!"<<std::endl;
const auto end_time = clock::now();
const auto time_diff = std::chrono::duration_cast<std::chrono::duration<double>>(end_time - start_time).count();
std::cout<<"Time = "<<time_diff<<"s"<<std::endl;
return 0;
}
[1]: https://i.sstatic.net/rCL9r.jpg