Skip to content

Faster way to read/write a std::unordered_map from/to a file

An answer to this question on Stack Overflow.

Question

I am working with some very large std::unordered_maps (hundreds of millions of entries) and need to save and load them to and from a file. The way I am currently doing this is by iterating through the map and reading/writing each key and value pair one at a time:

std::unordered_map<unsigned long long int, char> map;
void save(){
    std::unordered_map<unsigned long long int, char>::iterator iter;
    FILE *f = fopen("map", "wb");
    for(iter=map.begin(); iter!=map.end(); iter++){
        fwrite(&(iter->first), 8, 1, f);
        fwrite(&(iter->second), 1, 1, f);
    }
    fclose(f);
}
void load(){
    FILE *f = fopen("map", "rb");
    unsigned long long int key;
    char val;
    while(fread(&key, 8, 1, f)){
        fread(&val, 1, 1, f);
        map[key] = val;
    }
    fclose(f);
}

But with around 624 million entries, reading the map from a file took 9 minutes. Writing to a file was faster but still took several minutes. Is there a faster way to do this?

Answer

(Edit: I've added a new answer to this question which achieves a 95% decrease in wall-times.)

I made a Minimum Working Example that illustrates the problem you are trying to solve. This is something you should always do in your questions.

I then eliminated the unsigned long long int stuff and replaced it with uint64_t from the cstdint library. This ensures that we are operating on the same data size, since unsigned long long int can mean almost anything depending on what computer/compiler you use.

The resulting MWE looks like:

#include <chrono>
#include <cstdint>
#include <cstdio>
#include <deque>
#include <functional>
#include <iostream>
#include <random>
#include <unordered_map>
#include <vector>
typedef std::unordered_map<uint64_t, char> table_t;
const int TEST_TABLE_SIZE = 10000000;
void Save(const table_t &map){
  std::cout<<"Save. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "wb");
  for(auto iter=map.begin(); iter!=map.end(); iter++){
      fwrite(&(iter->first), 8, 1, f);
      fwrite(&(iter->second), 1, 1, f);
  }
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void SaveLookup(const table_t &map){
  std::cout<<"SaveLookup. ";
  const auto start = std::chrono::steady_clock::now();
  //Create a lookup table
  std::vector< std::deque<uint64_t> > lookup(256);
  for(auto &kv: map)
    lookup.at(kv.second+128).emplace_back(kv.first);
  //Save lookup table header
  FILE *f = fopen("/z/map", "wb");
  for(const auto &row: lookup){
    const uint32_t rowsize = row.size();
    fwrite(&rowsize, 4, 1, f);
  }
  //Save values
  for(const auto &row: lookup)
  for(const auto &val: row)
    fwrite(&val, 8, 1, f);
  
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to
//save time
void SaveLookupVector(const table_t &map){
  std::cout<<"SaveLookupVector. ";
  const auto start = std::chrono::steady_clock::now();
  //Create a lookup table
  std::vector< std::vector<uint64_t> > lookup(256);
  for(auto &kv: map)
    lookup.at(kv.second+128).emplace_back(kv.first);
  //Save lookup table header
  FILE *f = fopen("/z/map", "wb");
  for(const auto &row: lookup){
    const uint32_t rowsize = row.size();
    fwrite(&rowsize, 4, 1, f);
  }
  //Save values
  for(const auto &row: lookup)
    fwrite(row.data(), 8, row.size(), f);
  
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load(table_t &map){
  std::cout<<"Load. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");
  uint64_t key;
  char val;
  while(fread(&key, 8, 1, f)){
      fread(&val, 1, 1, f);
      map[key] = val;
  }
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load2(table_t &map){
  std::cout<<"Load with Reserve. ";
  map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");
  uint64_t key;
  char val;
  while(fread(&key, 8, 1, f)){
      fread(&val, 1, 1, f);
      map[key] = val;
  }
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void LoadLookup(table_t &map){
  std::cout<<"LoadLookup. ";
  map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");
  //Read the header
  std::vector<uint32_t> inpsizes(256);
  for(int i=0;i<256;i++)
    fread(&inpsizes[i], 4, 1, f);
  uint64_t key;
  for(int i=0;i<256;i++){
    const char val = i-128;    
    for(int v=0;v<inpsizes.at(i);v++){
      fread(&key, 8, 1, f);
      map[key] = val;
    }
  }
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to save time
void LoadLookupVector(table_t &map){
  std::cout<<"LoadLookupVector. ";
  map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");
  //Read the header
  std::vector<uint32_t> inpsizes(256);
  for(int i=0;i<256;i++)
    fread(&inpsizes[i], 4, 1, f);
  for(int i=0;i<256;i++){
    const char val = i-128;    
    std::vector<uint64_t> keys(inpsizes[i]);
    fread(keys.data(), 8, inpsizes[i], f);
    for(const auto &key: keys)
      map[key] = val;
  }
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
int main(){
  //Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
  auto generator = std::mt19937(12345); //Combination of my luggage
  //Generate values within the specified closed intervals
  auto key_rand  = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
  auto val_rand  = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);
  std::cout<<"Generating test data..."<<std::endl;
  //Generate a test table
  table_t map;
  for(int i=0;i<TEST_TABLE_SIZE;i++)
    map[key_rand()] = (char)val_rand(); //Low chance of collisions, so we get quite close to the desired size
  Save(map);
  { table_t map2; Load (map2); }
  { table_t map2; Load2(map2); }
  SaveLookup(map);
  SaveLookupVector(map);
  { table_t map2; LoadLookup      (map2); }
  { table_t map2; LoadLookupVector(map2); }
}

On the test data set I use, this gives me a write time of 1982ms and a read time (using your original code) of 7467ms. It seemed as though the read time is the biggest bottleneck, so I created a new function Load2 which reserves sufficient space for the unordered_map prior to reading. This dropped the read time to 4700ms (a 37% savings).

Edit 1

Now, I note that the values of your unordered_map can only take 255 distinct values. Thus, I can easily convert the unordered_map into a kind of lookup table in RAM. That is, rather than having:

123123 1
234234 0
345345 1
237872 1

I can rearrange the data to look like:

0 234234
1 123123 345345 237872

What's the advantage of this? It means that I no longer have to write the value to disk. That saves 1 byte per table entry. Since each table entry consists of 8 bytes for the key and 1 byte for the value, this should give me an 11% savings in both read and write time minus the cost of rearranging the memory (which I expect to be low, because RAM).

Finally, once I've done the above rearrangement, if I have a lot of spare RAM on the machine, I can pack everything into a vector and read/write the contiguous data to disk.

Doing all this gives the following times:

Save. Save time = 1836.52 ms
Load. Load time = 7114.93 ms
Load with Reserve. Load time = 4277.58 ms
SaveLookup. Save time = 1688.73 ms
SaveLookupVector. Save time = 1394.95 ms
LoadLookup. Load time = 3927.3 ms
LoadLookupVector. Load time = 3739.37 ms

Note that the transition from Save to SaveLookup gives an 8% speed-up and the transition from Load with Reserve to LoadLookup gives an 8% speed-up as well. This is right in line our theory!

Using contiguous memory as well gives a total of a 24% speed-up over your original save time and a total of a 47% speed-up over your original load time.