c++ - How to hint OpenMP Stride? -


i trying understand conceptual reason why openmp breaks loop vectorization. suggestions fixing helpful. considering manually parallelizing fix issue, not elegant , result in massive amount of code bloat, code consists of several such sections lend vectorization , parallelization.

i using

microsoft (r) c/c++ optimizing compiler version 17.00.60315.1 x64

with openmp:

info c5002: loop not vectorized due reason '502'

without openmp:

info c5001: loop vectorized

the vs vectorization page says error happens when:

induction variable stepped in manner other simple +1

can force step in stride 1?

the loop

#pragma omp parallel for (int j = 0; j < h*w; j++)//a,b,c,d,in __restricted {     float gs = d[j]-b[j];     float gc = a[j]-c[j];     in[j]=atan2f(gs,gc); } 

best effort(?)

#pragma omp parallel {// seems vectorize, still requires quite lot of boiler code     int middle = h*w/2;     #pragma omp sections nowait     {         #pragma omp section         (int j = 0; j < middle; j++)         {             float gs = d[j]-b[j];             float gc = a[j]-c[j];             in[j]=atan2f(gs,gc);         }         #pragma omp section         (int j = middle; j < h*w; j++)         {             float gs = d[j]-b[j];             float gc = a[j]-c[j];             in[j]=atan2f(gs,gc);         }     } } 

i recommend vectorization manually. 1 reason auto-vectorization not seem handle carried loop dependencies (loop unrolling).

to avoid code bloat , arcane intrinsics use agner fog's vectorclass. in experience it's fast using intrinsics , automatically takes advantage of sse2-avx2 (avx2 tested on intel emulator) depending on how compile. have written gemm code using vectorclass works on sse2 avx2 , when run on system avx code faster eigen uses sse. here function vectorclass (i did not try unrolling loop).

#include "omp.h" #include "math.h"  #include "vectorclass.h" #include "vectormath.h"  void loop(const int h, const int w, const int outer_stride, float *a, float *b, float *c, float *d, float* in) {     #pragma omp parallel     (int j = 0; j < h*w; j+=8)//a,b,c,d,in __restricted, w*h must multiple of 8     {         vec8f gs = vec8f().load(&d[j]) - vec8f().load(&b[j]);         vec8f gc = vec8f().load(&a[j]) - vec8f().load(&c[j]);         vec8f invec = atan(gs, gc);         invec.store(&in[j]);     }  } 

when doing vectorization have careful array bounds. in function above hw needs multiple of 8. there several solutions easiest , efficient solution make arrays (a,b,c,d,in) bit larger (maximum 7 floats larger) if necessary multiple of 8. however, solution use following code not require wh multiple of 8 it's not pretty.

#define round_down(x, s) ((x) & ~((s)-1)) void loop_fix(const int h, const int w, const int outer_stride, float *a, float *b, float *c, float *d, float* in) {     #pragma omp parallel     (int j = 0; j < round_down(h*w,8); j+=8)//a,b,c,d,in __restricted     {         vec8f gs = vec8f().load(&d[j]) - vec8f().load(&b[j]);         vec8f gc = vec8f().load(&a[j]) - vec8f().load(&c[j]);         vec8f invec = atan(gs, gc);         invec.store(&in[j]);     }     for(int j=round_down(h*w,8); j<h*w; j++) {         float gs = d[j]-b[j];         float gc = a[j]-c[j];         in[j]=atan2f(gs,gc);     }  } 

one challenge doing vectorization finding simd math library (e.g. atan2f). vectorclass supports 3 options. non-simd, libm amd, , svml intel (i used non-simd option in code above). simd math libraries sse , avx

some last comments might want consider. visual studio has auto-parallelization (off default) auto-vectorization (on default, @ least in release mode). can try instead of openmp reduce code bloat. http://msdn.microsoft.com/en-us/library/hh872235.aspx

additionally, microsoft has parallel patterns library. it's worth looking since microsoft's openmp support limited. it's easy openmp use. it's possible 1 of these options works better auto-vectorization (though doubt it). said, vectorization manually vectorclass.


Comments

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -