I want to implement streaming stores in my code on Intel MIC. I have a force_array and 3 variables tempx, tempy and tempz. I need to do some computation and then store them in another array which won't be used in near future. So I felt streaming stores would be a better choice to improve the performance. But I see that I am getting a segmentation fault and I am not sure if it is because of the load or the store. This code is preceded and succeeded by a few lines of code and the entire piece of code is inside two for loops which is preceded by openmp directives. As it is a parallel program, I am not able to debug it well. Can anyone help me by finding out the mistake(s) ?
Thanks in advance !!! The code is given below:
for(k=0;k<np;k++) //np is the number of particles.
{
for(j=k+1;j<np;j++)
{
__m512d y1, y2, y3, y4, y5, y6;
y1 = _mm512_load_pd(force_array + k*nd + 0);
y4 = _mm512_load_pd(&tempx);
y1 = _mm512_sub_pd(y1,y4);
y2 = _mm512_load_pd(force_array + k*nd + 1);
y5 = _mm512_load_pd(&tempy);
y2 = _mm512_sub_pd(y2,y5);
y3 = _mm512_load_pd(force_array + k*nd + 2);
y6 = _mm512_load_pd(&tempz);
y3 = _mm512_sub_pd(y3,y6);
_mm512_storenr_pd((f+k*nd+0), y1);
_mm512_storenr_pd((f+k*nd+1), y2);
_mm512_storenr_pd((f+k*nd+2), y3);
}
}