如果我有一个内核,它回顾最后的 Xmins 并计算 float[] 中所有值的平均值,如果所有线程没有同时执行同一行代码,我会遇到性能下降吗?
例如:假设@ x=1500,过去 2 小时有 500 个数据点。
@ x = 1510,过去 2 小时有 300 个数据点。
x = 1500 处的线程必须回溯 500 个位置,而 x = 1510 处的线程只回溯 300,因此后面的线程将在第一个线程完成之前移动到下一个位置。
这通常是一个问题吗?
编辑:示例代码。对不起,但它在 C# 中,因为我打算使用 CUDAfy.net。希望它提供了我需要运行的编程结构类型的粗略概念(实际代码更复杂但结构相似)。任何关于这是否适用于 GPU / 协处理器或仅适用于 CPU 的评论将不胜感激。
public void PopulateMeanArray(float[] data)
{
float lookFwdDistance = 108000000000f;
float lookBkDistance = 12000000000f;
int counter = thread.blockIdx.x * 1000; //Ensures unique position in data is written to (assuming i have less than 1000 entries).
float numberOfTicksInLookBack = 0;
float sum = 0; //Stores the sum of difference between two time ticks during x min look back.
//Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
//Thread 1 could be working here.
for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
{
sum = 0;
numberOfTicksInLookBack = 0;
//Thread 2 could be working here. Is this warp divergence?
for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
{
sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
numberOfTicksInLookBack++;
}
data[counter] = sum/numberOfTicksInLookBack;
counter++;
}
}