c++ - How to hint OpenMP Stride? -
i trying understand conceptual reason why openmp breaks loop vectorization. suggestions fixing helpful. considering manually parallelizing fix issue, not elegant , result in massive amount of code bloat, code consists of several such sections lend vectorization , parallelization.
i using
microsoft (r) c/c++ optimizing compiler version 17.00.60315.1 x64
with openmp:
info c5002: loop not vectorized due reason '502'
without openmp:
info c5001: loop vectorized
the vs vectorization page says error happens when:
induction variable stepped in manner other simple +1
can force step in stride 1?
the loop
#pragma omp parallel for (int j = 0; j < h*w; j++)//a,b,c,d,in __restricted { float gs = d[j]-b[j]; float gc = a[j]-c[j]; in[j]=atan2f(gs,gc); }
best effort(?)
#pragma omp parallel {// seems vectorize, still requires quite lot of boiler code int middle = h*w/2; #pragma omp sections nowait { #pragma omp section (int j = 0; j < middle; j++) { float gs = d[j]-b[j]; float gc = a[j]-c[j]; in[j]=atan2f(gs,gc); } #pragma omp section (int j = middle; j < h*w; j++) { float gs = d[j]-b[j]; float gc = a[j]-c[j]; in[j]=atan2f(gs,gc); } } }
i recommend vectorization manually. 1 reason auto-vectorization not seem handle carried loop dependencies (loop unrolling).
to avoid code bloat , arcane intrinsics use agner fog's vectorclass. in experience it's fast using intrinsics , automatically takes advantage of sse2-avx2 (avx2 tested on intel emulator) depending on how compile. have written gemm code using vectorclass works on sse2 avx2 , when run on system avx code faster eigen uses sse. here function vectorclass (i did not try unrolling loop).
#include "omp.h" #include "math.h" #include "vectorclass.h" #include "vectormath.h" void loop(const int h, const int w, const int outer_stride, float *a, float *b, float *c, float *d, float* in) { #pragma omp parallel (int j = 0; j < h*w; j+=8)//a,b,c,d,in __restricted, w*h must multiple of 8 { vec8f gs = vec8f().load(&d[j]) - vec8f().load(&b[j]); vec8f gc = vec8f().load(&a[j]) - vec8f().load(&c[j]); vec8f invec = atan(gs, gc); invec.store(&in[j]); } }
when doing vectorization have careful array bounds. in function above hw needs multiple of 8. there several solutions easiest , efficient solution make arrays (a,b,c,d,in) bit larger (maximum 7 floats larger) if necessary multiple of 8. however, solution use following code not require wh multiple of 8 it's not pretty.
#define round_down(x, s) ((x) & ~((s)-1)) void loop_fix(const int h, const int w, const int outer_stride, float *a, float *b, float *c, float *d, float* in) { #pragma omp parallel (int j = 0; j < round_down(h*w,8); j+=8)//a,b,c,d,in __restricted { vec8f gs = vec8f().load(&d[j]) - vec8f().load(&b[j]); vec8f gc = vec8f().load(&a[j]) - vec8f().load(&c[j]); vec8f invec = atan(gs, gc); invec.store(&in[j]); } for(int j=round_down(h*w,8); j<h*w; j++) { float gs = d[j]-b[j]; float gc = a[j]-c[j]; in[j]=atan2f(gs,gc); } }
one challenge doing vectorization finding simd math library (e.g. atan2f). vectorclass supports 3 options. non-simd, libm amd, , svml intel (i used non-simd option in code above). simd math libraries sse , avx
some last comments might want consider. visual studio has auto-parallelization (off default) auto-vectorization (on default, @ least in release mode). can try instead of openmp reduce code bloat. http://msdn.microsoft.com/en-us/library/hh872235.aspx
additionally, microsoft has parallel patterns library. it's worth looking since microsoft's openmp support limited. it's easy openmp use. it's possible 1 of these options works better auto-vectorization (though doubt it). said, vectorization manually vectorclass.
Comments
Post a Comment