试试这个代码,它有向量来乘(和除)你的要向量化的变量。:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < indices_num; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
也许这更容易矢量化。
Aginst 未知循环边界,尝试给出一个直接常量而不是 indices_num。这个编译器不是即时编译器(也许是,但我没有听说过除了 java 之外的其他东西)所以,给出一个编译时已知的常量可能会起作用。
这里:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
有时数组未正确对齐以进行矢量化指令。例如,cpu 只能提高 32B(或 16B)对齐阵列的读/写性能。未对齐的读/写速度较慢(或不可矢量化)
这里:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
if(reinterpret_cast<size_t>pixels%32!=0)
{
printf("array is not aligned! need to shift array or need to do serial calc. until aligned offset reached!");
//do non-vectorized calc. When aligned offset reached, goto vectorizing code.
}
else
{
printf("array is aligned! Starting fast access.");
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
}
也许有人可以打开 memcpy 或一些数组复制 asm 文件并在其中注入一些乘法代码并编译为 memcpy_with_multiplication(,,,) ?
我的最后一个建议:将 r,g,b,a 包装在一个数组中,以便它们位于连续的地址中。这里:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float rgba[4];
rgba[0] = pixels[idx + 0]/dividerV[0];
rgba[1] = pixels[idx + 1]/dividerV[1];
rgba[2] = pixels[idx + 2]/dividerV[2];
rgba[3] = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = rgba[0]*multiplierV[0];
pixels[idx + 1] = rgba[1]*multiplierV[1];
pixels[idx + 2] = rgba[2]*multiplierV[2];
pixels[idx + 3] = rgba[3]*multiplierV[3];
}
return;
}
"indices[i]" 不是一个明确的索引参数。这可能很糟糕。尝试其他方式向编译器展示这一点。当你只放 i 而不是 indices[i] 时会发生什么?它编译相同吗?indices[i] 在编译时无法知道,或者对于编译器来说太复杂了。
更简单(也是错误的)和更可矢量化:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//you need to sorted version of indices[](or pixels[]) array to achieve something like this.
for (int i = 0; i < 4000; i+=4)
{
float rgba[4];
rgba[0] = pixels[i + 0]/dividerV[0];
rgba[1] = pixels[i + 1]/dividerV[1];
rgba[2] = pixels[i + 2]/dividerV[2];
rgba[3] = pixels[i + 3]/dividerV[3];
pixels[i + 0] = rgba[0]*multiplierV[0];
pixels[i + 1] = rgba[1]*multiplierV[1];
pixels[i + 2] = rgba[2]*multiplierV[2];
pixels[i + 3] = rgba[3]*multiplierV[3];
}
return;
}