Is Highams' computation of mean worth the price?

2019-02-17

An answer to this question on the Scientific Computing Stack Exchange.

Question

In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean: $$ M_{1} := x_1, \quad M_{k+1} := M_{k} + \frac{x_k - M_k}{k} $$ Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $\sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.

But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.

So it Higham's update formula worth using? Are there benefits I'm not seeing?

Note: Higham gives the following generic advice (Section 1.18 "Designing Stable Algorithms):

"It is advantageous to express update formulae as newvalue = oldvalue + smallcorrection if the small correction can be computed with many correct significant figures."

Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.

Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see

Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219

However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case, I was hoping for a worst and average case bound by accumulating rounding errors.

Answer

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
  const int num_count = 100000;
  alignas(16) float data[num_count];
  for(int i=0;i<num_count;i++)
    data[i] = rand()%100000+rand()/(double)RAND_MAX;
  {
    const auto t1 = std::chrono::high_resolution_clock::now();
    float sum=0;
    for(int i=0;i<num_count;i++){
      sum += data[i];
    }
    const float mean=sum/num_count;
    const auto t2 = std::chrono::high_resolution_clock::now();
    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
    std::cout << "Exec time:\t" << time_span1.count() << " s\n";
    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
  }
  {
    const auto t1 = std::chrono::high_resolution_clock::now();
    __m128 mean = _mm_load_ps(&data[0]);
    for (int i=4;i<num_count;i+=4){
      const __m128 x    = _mm_load_ps(&data[i]);
      const __m128 diff = _mm_sub_ps(x,mean);
      const __m128 k    = _mm_set1_ps(i/4);
      const __m128 div  = _mm_div_ps(diff, k);
      mean              = _mm_add_ps(mean, div);
    }
    float result[4];
    _mm_store_ps(result, mean);
    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
    const auto t2 = std::chrono::high_resolution_clock::now();
    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
    std::cout << "Exec time:\t" << time_span1.count() << " s\n";
    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
  }
}
}

and observe that

Exec time:	0.000225851 s
Simple Mean = 49891.23046875
Exec time:	0.0003759360000000000002 s
Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.