c++ - 使用 Nvidia 的 Thrust 库规范化一堆向量

Question

我刚刚了解了 Nvidia 的推力库。只是为了尝试它写了一个小例子，它应该规范化一堆向量。

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor: public thrust::unary_function<double4, double4>
{
    __device__ __host__ double4 operator()(double4 v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
    }
};

int main()
{
    thrust::host_vector<double4> v(2);
    v[0].x = 1; v[0].y = 2; v[0].z = 3;
    v[1].x = 4; v[1].y = 5; v[1].z = 6;

    thrust::device_vector<double4> v_d = v; 
    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    // This doesn't seem to copy back
    v = v_d;

    // Neither this does..
    thrust::host_vector<double4> result = v_d;

    for(int i=0; i<v.size(); i++)
        printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z);

    return 0;
}

上面的示例似乎有效，但是我无法将数据复制回来。我认为一个简单的分配会调用 cudaMemcpy。它可以将数据从主机复制到设备但不能返回？？？

其次，我不确定我这样做是否正确。for_each的文档说：

for_each 将函数对象 f 应用于 [first, last) 范围内的每个元素；f 的返回值（如果有）将被忽略。

但是 unary_function 结构模板需要两个模板参数（一个用于返回值）并强制 operator() 也返回一个值，这会导致编译时出现警告。我不明白我应该如何编写一个没有返回值的一元仿函数。

接下来是数据排列。我只是选择了 double4，因为这将导致两个获取指令 ld.v2.f64 和 ld.f64 IIRC。但是我想知道推力如何在后台获取数据（以及创建了多少 cuda 线程/块）。如果我选择 4 个向量的结构，它是否能够以合并的方式获取数据。

最后推力提供元组。元组数组呢？在这种情况下如何安排数据。

我浏览了这些示例，但我没有找到一个示例来解释为一堆向量选择哪种数据结构（dot_products_with_zip.cu 示例说明了“数组结构”而不是“结构数组”，但我明白了示例中未使用任何结构。

更新

我修复了上面的代码并尝试运行一个更大的示例，这次是对 10k 个向量进行归一化。

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor
{
    __device__ __host__ void operator()(double4& v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
    }
};

int main()
{
    int n = 10000;
    thrust::host_vector<double4> v(n);
    for(int i=0; i<n; i++) {
        v[i].x = rand();
        v[i].y = rand();
        v[i].z = rand();
    }

    thrust::device_vector<double4> v_d = v;

    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    v = v_d;

    return 0;
}

使用 computeprof 进行分析向我展示了低占用率和未合并的内存访问：

Kernel Occupancy Analysis

Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1
Register Ratio      = 0.984375  ( 32256 / 32768 ) [24 registers per thread] 
Shared Memory Ratio     = 0 ( 0 / 49152 ) [0 bytes per Block] 
Active Blocks per SM        = 3 / 8
Active threads per SM       = 1344 / 1536
Potential Occupancy     = 0.875  ( 42 / 48 )
Max achieved occupancy  = 0.583333  (on 9 SMs)
Min achieved occupancy  = 0.291667  (on 5 SMs)
Occupancy limiting factor   = Block-Size

Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470

Kernel requested global memory read throughput(GB/s): 29.21
Kernel requested global memory write throughput(GB/s): 17.52
Kernel requested global memory throughput(GB/s): 46.73
L1 cache read throughput(GB/s): 100.40
L1 cache global hit ratio (%): 48.15
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 42.44
L2 cache global memory write throughput(GB/s): 46.73
L2 cache global memory throughput(GB/s): 89.17
L2 cache read hit ratio(%): 88.86
L2 cache write hit ratio(%): 3.09
Local memory bus traffic(%): 0.00
Global memory excess load(%): 31.18
Global memory excess store(%): 62.50
Achieved global memory read throughput(GB/s): 4.73
Achieved global memory write throughput(GB/s): 45.29
Achieved global memory throughput(GB/s): 50.01
Peak global memory throughput(GB/s): 133.92

我想知道如何优化这个？

score 4 · Accepted Answer

如果要就地修改序列，for_each则需要在仿函数中通过引用获取参数：

struct normalize_functor
{
    __device__ __host__ void operator()(double4& ref)
    {
        double v = ref;
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
        ref = v;
    }
};

或者，您可以将您的定义normalize_functor与transform算法一起使用，v_d同时指定源和目标范围：

thrust::transform(v_d.begin(), v_d.end(), v_d.begin(), normalize_functor());

我个人的偏好是transform在这种情况下使用，但两种情况下的性能应该是相同的。

score 1 · Accepted Answer

在优化问题上，Thrust 并不能做太多事情——这并不是图书馆的真正意图。Nathan Bell 是 Thrust 的作者之一，他已经在这个帖子中发帖了，如果有的话，CUDA 代码。在我看来，它在这方面取得了惊人的成功。许多推力内核的内核性能接近最先进的水平，但总有一些优化可以在特定情况下完成，而在通用模板代码中不容易做到。这是您为 Thrust 提供的易用性和灵活性付出的代价的一部分。

话虽如此，我怀疑在您的操作员功能中尝试进行一些调整，这可能会改善事情。我通常会写这样的东西：

struct normalize_functor
{
    __device__ __host__ void operator()(double4& v)
    {
        double4 nv = v;
        double len = sqrt(nv.x*nv.x + nv.y*nv.y + nv.z*nv.z);
        nv.x /= len;
        nv.y /= len;
        nv.z /= len;
        (void)nv.h;
        v = nv;
    };
};

现在虽然它不像原来的那么漂亮，但它应该确保编译器发出向量化的加载和存储指令。我在过去看到过编译器会优化掉向量类型的未使用成员的加载和存储的情况，这会导致 PTX 生成器发出标量加载和存储，并因此破坏合并。通过明确的 float4 加载和存储，并确保使用结构的每个元素，它可以绕过至少在 2.x 和 3.x nvcc 版本中存在的这种不需要的“优化”。我不确定 4.0 编译器是否仍然如此。

c++ - 使用 Nvidia 的 Thrust 库规范化一堆向量

2 回答 2

Related

Reference