cuda - For Loop in Device Function on Compute Capability 1.1 设备

Question

我写了一个__device__使用for循环的函数。它适用于 GTX640 卡（计算能力 2.1），但不适用于 9500GT（计算能力 1.1）。

函数大致是这样的：

__device__ void myFuncD(float4 *myArray, float4 *result, uint index, uint foo, uint *here, uint *there)
{
    uint j;
    float4 myValue = myArray[index];
    uint idxHere = here[foo];
    uint idxThere = there[foo];
    float4 temp;

    for(j=idxHere;j<idxThere;j++){
        temp = myArray[j];

        //do things with myValue and temp, write result to *result
        result->x += /* some calculations with myValue.x and temp.x */
        result->y += /* some calculations with myValue.y and temp.y */
        result->z += /* some calculations with myValue.z and temp.z */
    }
}

__global__ void myKernelD(float4 *myArray, float4 *myResults, uint *here, uint *there)
{
    uint index = blockDim.x*blockIdx.x+threadIdx.x;

    float4 result = = make_float4(0.0f,0.0f,0.0f,0.0f);
    uint foo1, foo2, foo3, foo4;

    //compute foo1, foo2, foo3, foo4 based on myArray[index]

    myFuncD(myArray, &result, index, foo1, here, there);
    myFuncD(myArray, &result, index, foo2, here, there);
    myFuncD(myArray, &result, index, foo3, here, there);
    myFuncD(myArray, &result, index, foo4, here, there);

    myResults[index] = result;
}

在 GTX460 上，myResults具有适当的值，但在 9500GT 上，其成员的每个组件都为零。

如何使用计算能力 1.1 的设备达到相同的效果？

score 1 · Accepted Answer

用户试图在每个块中使用太多线程来启动，并收到错误“启动请求的资源过多”。减少每个块的线程允许内核启动。

cuda - For Loop in Device Function on Compute Capability 1.1 设备

1 回答 1

Related

Reference