Average calculation in Python

2018-12-23

average numpy performance python stackexchange stackoverflow

An answer to this question on Stack Overflow.

Question

I am trying to speed up a python snippet.

Given two equal-sized (numpy) arrays, the goal is to find the average of values in one array, say a, corresponding to the values of another array, say b. The indices of the arrays are in sync.

For example;

a = np.array([1, 1, 1, 2, 2, 2])
b = np.array([10, 10, 10, 20, 20, 20])

There are two distinct values in a, 1 and 2. The values in b where there is a "1" in a at the same index are [10, 10, 10]. Hence average(1) is 10. Analogously, average(2) is 20.

We can assume that the distinct set of values in a are known apriori. The values in a need not be consecutive, the order is random. I have chosen the example as such just to ease the description.

Here is how I approached it:

# Accumulate the total sum and count
    for index, val in np.ndenumerate(a):
        val_to_sum[val] += b[index]
        val_to_count[val] += 1
    # Calculate the mean
    for val in val_to_sum.keys():
        if val_to_count[val]:  # skip vals with zero count
            val_to_mean[val] = val_to_sum[val] / val_to_count[val]

Here val_to_sum and val_to_count are dictionaries that are initialized to zeros based on the known list of values that can be seen in a (1 and 2 in this case).

I doubt that this is the fastest way to calculate it. I expect the lists to be quite long, say a few million, and the set of possible values to be in the orders of tens.

How can I speed up this computation?

Could the solution be? Inspired by one of the answers below, this might do it:

for val in a 
  b[a==val].mean()

Answer

Perhaps something like this would work:

import numpy as np
a = np.array([1, 1, 1, 2, 2, 2])
b = np.array([10, 10, 10, 20, 20, 20])
np.average(b[a==1])
np.average(b[a==2])

For larger datasets:

import numpy as np
a = np.random.randint(1,30,1000000)
b = np.random.random(size=1000000)
for x in set(a):
  print("Average for values marked {0}: {1}".format(x,np.average(b[a==x])))