cuda - 纹理获取比直接全局访问慢，“Cuda by example”一书的第 7 章

Question

我正在阅读和测试“Cuda By example. An Introduction to General Purpose GPU Programming”一书中的示例。在测试第 7 章中的示例时，相对于纹理内存，我意识到通过纹理缓存访问全局内存比直接访问要慢得多（我的 NVIDIA GPU 是 GeForceGTX 260，计算能力 1.3，我使用的是 NVDIA CUDA 4.2）：

256*256 图像的纹理提取（1D 或 2D）的每帧时间：93 毫秒
256*256 不使用纹理（仅直接全局访问）的每帧时间：8.5 毫秒

我已经仔细检查了几次代码，我也一直在阅读 SDK 随附的“CUDA C 编程指南”和“CUDA C 最佳实践指南”，但我并不真正理解问题所在。据我了解，纹理内存只是具有特定访问机制实现的全局内存，使其看起来像缓存（？）。我想知道对全局内存的合并访问是否会使纹理获取速度变慢，但我不能确定。

有没有人有类似的问题？（我在 NVIDIA 论坛中找到了一些类似问题的链接，但该链接不再可用。）

测试代码看起来是这样的，只包括相关部分：

//#define TEXTURE
//#define TEXTURE2

#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif

    __global__ void copy_const_kernel( float *iptr
    #ifdef TEXTURE2
     ){
    #else
        ,const float *cptr ) {
    #endif
            // map from threadIdx/BlockIdx to pixel position
            int x = threadIdx.x + blockIdx.x * blockDim.x;
            int y = threadIdx.y + blockIdx.y * blockDim.y;
            int offset = x + y * blockDim.x * gridDim.x;

    #ifdef TEXTURE2
            float c = tex1Dfetch(texConstSrc,offset);
    #else
            float c = cptr[offset];     
    #endif

            if ( c != 0) iptr[offset] = c;
    }

    __global__ void blend_kernel( float *outSrc,
    #ifdef TEXTURE
        bool dstOut ) {
    #else
        const float *inSrc ) {
    #endif
            // map from threadIdx/BlockIdx to pixel position
            int x = threadIdx.x + blockIdx.x * blockDim.x;
            int y = threadIdx.y + blockIdx.y * blockDim.y;
            int offset = x + y * blockDim.x * gridDim.x;
            int left = offset - 1;
            int right = offset + 1;
            if (x == 0) left++;
            if (x == SXRES-1) right--;
            int top = offset - SYRES;
            int bottom = offset + SYRES;
            if (y == 0) top += SYRES;
            if (y == SYRES-1) bottom -= SYRES;

    #ifdef TEXTURE
            float t, l, c, r, b;
            if (dstOut) {
                t = tex1Dfetch(texIn,top);
                l = tex1Dfetch(texIn,left);
                c = tex1Dfetch(texIn,offset);
                r = tex1Dfetch(texIn,right);
                b = tex1Dfetch(texIn,bottom);
            } else {
                t = tex1Dfetch(texOut,top);
                l = tex1Dfetch(texOut,left);
                c = tex1Dfetch(texOut,offset);
                r = tex1Dfetch(texOut,right);
                b = tex1Dfetch(texOut,bottom);
            }
            outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
    #else
            outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
                inSrc[bottom] + inSrc[left] + inSrc[right] -
                inSrc[offset]*4);
    #endif
    }

    // globals needed by the update routine
    struct DataBlock {
        unsigned char *output_bitmap;
        float *dev_inSrc;
        float *dev_outSrc;
        float *dev_constSrc;
        cudaEvent_t start, stop;
        float totalTime;
        float frames;
        unsigned size;
        unsigned char *output_host;
    };
    void anim_gpu( DataBlock *d, int ticks ) {
        checkCudaErrors( cudaEventRecord( d->start, 0 ) );
        dim3 blocks(SXRES/16,SYRES/16);
        dim3 threads(16,16);

    #ifdef TEXTURE
        volatile bool dstOut = true;
    #endif

        for (int i=0; i<90; i++) {
    #ifdef TEXTURE
            float *in, *out;
            if (dstOut) {
                in = d->dev_inSrc;
                out = d->dev_outSrc;
            } else {
                out = d->dev_inSrc;
                in = d->dev_outSrc;
            }
    #ifdef TEXTURE2
            copy_const_kernel<<<blocks,threads>>>( in );
    #else
            copy_const_kernel<<<blocks,threads>>>( in,
                d->dev_constSrc );
    #endif
            blend_kernel<<<blocks,threads>>>( out, dstOut );
            dstOut = !dstOut;

    #else
            copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
                d->dev_constSrc );
            blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
                d->dev_inSrc );
            swap( d->dev_inSrc, d->dev_outSrc );
    #endif
        }
            // Some stuff for the events
            // ...
         }

score 2 · Accepted Answer

我一直在使用 nvvp（NVIDIA 分析器）测试结果

结果非常奇怪，因为它们表明有很多纹理缓存未命中（这可能是性能不佳的原因）。分析器的结果还显示了即使使用指南“CUPTI_User_GUIde”也难以理解的信息：

text_cache_hit：纹理缓存命中数（根据 1.3 能力，它们仅占一个 SM）。
text_cache_miss：纹理缓存未命中数（根据 1.3 能力仅占一个 SM）。

以下是不使用纹理缓存的 256*256 示例的结果（仅显示相关信息）：

名称持续时间(ns) Grid_Size Block_Size

“copy_const_kernel(...) 22688 16,16,1 16,16,1

“混合内核（...）” 51360 16,16,1 16,16,1

以下是使用 1D 纹理缓存的结果：

名称持续时间(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss

“copy_const_kernel(...)”147392 16,16,1 16,16,1 0 1024

“混合内核（...）” 841728 16,16,1 16,16,1 79 5041

以下是使用 2D 纹理缓存的结果：

名称持续时间(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss

“copy_const_kernel(...)”150880 16,16,1 16,16,1 0 1024

“混合内核（...）” 872832 16,16,1 16,16,1 2971 2149

这些结果显示了几个有趣的信息：

“copy const”函数根本没有缓存命中（尽管理想情况下内存是“空间定位的”，因为每个线程访问的内存靠近其他近线程访问的内存）。我猜这是因为这个函数中的线程不会从其他线程访问内存，这似乎是纹理缓存可用的方式（“空间定位”概念相当混乱）
对于函数“blend_kernel”，1D 中有一些缓存命中，而在 2D 情况下还有更多。我猜这是由于在该函数中，任何线程都从其邻居线程访问内存。我不明白为什么 2D 比 1d 多。
有纹理情况下的持续时间比无纹理情况下的要长（几乎大约一个数量级）。也许与如此多的纹理缓存未命中有关。
对于“copy_const”函数，SM 总共有 1024 次访问，而“混合内核”则有 5120 次。关系 5:1 是正确的，因为“blend”中有 5 次提取，而“copy_const”中只有 1 次。无论如何，我无法理解所有这些 1024 来自哪里：理想情况下，这个事件“文本缓存未命中/热”只占一个 SM（我的 GeForceGTX 260 中有 24 个），它只占扭曲（32 个线程大小）。因此，我每个 SM 有 256 个线程/32=8 个扭曲，每个 SM 有 256 个块/24 = 10 或 11 个“迭代”，所以我预计会有 80 或 88 次获取（此外，还有一些其他事件，如 sm_cta_launched，其中是每个 SM 的线程块数，应该在我的 1.3 设备中支持，始终为 0...)

cuda - 纹理获取比直接全局访问慢，“Cuda by example”一书的第 7 章

1 回答 1

Related

Reference