How can I optimize omp pragmas to run code between parallel regions?

2018-12-27

An answer to this question on Stack Overflow.

Question

I have this C code that I need to optimize with OpenMP, I can't write the original code, but here is a surrogate:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
void Funct(double *vec, int len)
{
    int i;
    double tmp; 
    //Section 1
    #pragma omp parallel for
    for ( i = 0; i < len; i++ )    //Code that initialize vec, it simulates an initialization in the original code
        vec [ i ] = i; 
    
    //Section 2
    //This code must be run sequentially
    tmp = vec [ 0 ];
    vec [0 ] = vec [ len - 1 ];
    vec [ len - 1 ] = tmp;
    
    tmp = vec [ 0 ];
    vec [0 ] = vec [ len - 1 ];
    vec [ len - 1 ] = tmp; 
    //End of the sequential code 
    //Section 3
    #pragma omp parallel for
    for ( i = 0; i < len; i++ )    //Code to simulate loadwork on vec
    {
        vec [ i ] = pow(vec[i], 2 ); 
        vec [ i ] = sqrt ( vec [ i ] );
        vec [ i ] += 1;
        vec [ i ] = pow(vec[i], 2 ); 
        vec [ i ] = sqrt ( vec [ i ] );
        vec [ i ] -= 1;
    }
}
int main ()
{
    double *vec;
    int i;
    vec = (double *) malloc ( sizeof ( double ) * 5104 );  //Length of the vector in the original code
    
    for ( i = 0; i < 1000000; i++ )    //Iteration in the original code 
        Funct(vec, 5104 );
    
    for ( i = 0; i < 5; i++ )      // Access the array to avoid -O2 cancellations
    printf ("%.2f ", vec [ i * 1000 ] );
    return 0;
}

In Funct, Section 1, 2 and 3 must be executed sequentially; Section 2 is strictly sequential.

In the original code I'm forced to use parallelization inside the function Funct(...), so, sadly, the cost of the creation of the threads are multiplied by the number of iterations, but is not a problem since it still permits some time optimization when the for inside the main or the vec length arise ( If you have suggestions I'm very open to listen ). The problem is "Section 2", in fact it makes OMP, I think, create a barrier or a wait, but that slows down the execution; If I remove that section I get a pretty acceptable optimization, respect to the sequential code; saddly I can't. I've tried omp single, omp critical, and so on, to see if it would have assigned the code to some of threads of the prvious pool, but none, is there a way to make more performant? ( Like drastically change the pragmas, not a problem )

( Compiled with gcc file.c -o file.out -lm -O2 -fopenmp, tested under Linux Lubuntu using time ./file.out )

Edit 1: I'd like to point out that

tmp = vec [ 0 ];
vec [0 ] = vec [ len - 1 ];
vec [ len - 1 ] = tmp;
tmp = vec [ 0 ];
vec [0 ] = vec [ len - 1 ];
vec [ len - 1 ] = tmp;

Is just random code I've put inside the method to make clear that must be run sequentially ( it performs two time the same operation, it swaps vec [ 0 ] and vec [ len - 1 ], so at the end of the execution nothing really happend ); I could have written any other function or code instead;

For instance I could have put

Foo1();
Foo2();
Foo3();

Answer

Set your loop indices to

for ( i = 1; i < len-1; i++ )

and treat the first and last elements as special cases. They can be executed outside of the OpenMP regions.