cuda - 从推力到arrayfire - gfor 用法？

Question

我正在尝试替换一些对 arrayfire 的推力调用以检查性能。

我不确定我是否正确使用了arrayfire，因为我得到的结果根本不匹配。

所以，例如我使用的推力代码是：

cudaMalloc( (void**) &devRow, N * sizeof(float) );
...//devRow is filled

thrust::device_ptr<float> SlBegin( devRow );
for ( int i = 0; i < N; i++, SlBegin += PerSlElmts )
{
    thrust::inclusive_scan( SlBegin, SlBegin + PerSlElmts, SlBegin );
}

cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );
//use theRow...

阵列火：

af::array SlBegin( N , devRow );
for ( int i = 0;i < N; i++,SlBegin += PerSlElmts )
{
    accum( SlBegin );
}

cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );
//use theRow..

我不确定 arrayfire 如何处理副本：af::array SlBegin( N , devRow );。在推力中，我们有从 devRow 指向 SlBegin 的设备指针，但是在 arrayfire 中..？

另外，我想问一下使用 gfor 。在arrayfire 网页中，它指出

不要直接使用此功能；请参阅 GFOR：并行 For 循环。

然后对于 GFOR ：

当前版本的 ArrayFire 中禁用了 GFOR

那么，我们不能使用 gfor 吗？

- - - - -更新 - - - - - - - - - - - - - -

我有一个小的运行示例，它显示了不同的结果：

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>

#include "arrayfire.h"

#include <thrust/scan.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

__global__ void Kernel( const int N ,float * const devRow )
{

   int i = threadIdx.x;
   if ( i < N )
        devRow[ i ] = i;

 }

int main(){

    int N = 6;
    int Slices = 2;
    int PerSlElmts = 3;

    float * theRow = (float*) malloc ( N * sizeof( float ));

    for ( int i = 0; i < N; i ++ )
        theRow[ i ] = 0;

    // raw pointer to device memory
    float * devRow;
    cudaMalloc( (void **) &devRow, N * sizeof( float ) );

    Kernel<<< 1,N >>>( N , devRow );
    cudaDeviceSynchronize();

    // wrap raw pointer with a device_ptr
    thrust::device_ptr<float> SlBegin( devRow );

    for ( int i = 0; i < Slices; i++ , SlBegin += PerSlElmts )
        thrust::inclusive_scan( SlBegin, SlBegin + PerSlElmts , SlBegin );

    cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );

    for ( int i = 0; i < N; i++ )
        printf("\n Thrust accum : %f",theRow[ i ] );


    //--------------------------------------------------------------------//
    Kernel<<< 1,N >>>( N , devRow );
    cudaDeviceSynchronize();

    af::array SlBeginFire( N, devRow );

    for ( int i = 0; i < Slices; i++ , SlBeginFire += PerSlElmts )
        af::accum( SlBeginFire );

    SlBeginFire.host( theRow );

    for ( int i = 0; i < N; i++ )
            printf("\n Arrayfire accum : %f",theRow[ i ] );

    cudaFree( devRow );
    free( theRow );


    return 0;

}

score 2 · Accepted Answer

看起来您正在尝试在 2D 数组上运行按列（ArrayFire 中的 0th-dim）扫描。这是您可以使用的一些代码：

af::array SlBegin(N, devRow);
af::array result = accum(SlBegin, 0);

这是一个示例输出

A [5 3 1 1]
0.7402     0.4464     0.7762 
0.9210     0.6673     0.2948 
0.0390     0.1099     0.7140 
0.9690     0.4702     0.3585 
0.9251     0.5132     0.6814 

accum(A, 0) [5 3 1 1]
0.7402     0.4464     0.7762 
1.6612     1.1137     1.0709 
1.7002     1.2236     1.7850 
2.6692     1.6938     2.1435 
3.5943     2.2070     2.8249

这将独立地在每一列上运行和包含扫描。

至于 gfor，它已被添加到 ArrayFire 的开源版本中。由于此代码库仍处于测试阶段，因此改进和修复正在非常迅速地进行。所以请留意我们的 github 页面。

cuda - 从推力到arrayfire - gfor 用法？

1 回答 1

Related

Reference