cuda - CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

Question

EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).

So, I have crafted a "minimal" working example, which follows:

#include <cuComplex.h>
#include <iostream>

int const n = 512 * 100;

typedef float real;

template < class T >
struct my_complex {
   T x;
   T y;
};

__global__ void set( my_complex< real > * a )
{
   my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d = { 1.0f, 0.0f };
}

__global__ void duplicate_whole( my_complex< real > * a )
{
   my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d = { 2.0f * d.x, 2.0f * d.y };
}

__global__ void duplicate_half( real * a )
{
   real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d *= 2.0f;
}

int main()
{
   my_complex< real > * a;
   cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );

   set<<< n, 1024 >>>( a );
   cudaDeviceSynchronize();
   duplicate_whole<<< n, 1024 >>>( a );
   cudaDeviceSynchronize();
   duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
   cudaDeviceSynchronize();

   my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
   cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );

   std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )"  << std::endl;

   return 0;
}

When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.

However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing

Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4

I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.

I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?

And why is that I do not get a performance hit?

I hope that this edit has clarified some unclear points.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I'll start wit some kernel code to exemplify my question, which will follow below

template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
                                        std::complex< data_t > * data,
                                        int M,
                                        int row_length,
                                        int b,
                                        int i_0
                                        )
{
#ifndef CUGALE_MUL_SHUFFLE
    // Output array length:
    int plane_area = row_length * M;
    // Process element:
    int i = blockIdx.x * row_length + threadIdx.x + i_0;
    my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
    my_complex< data_t > datum;
    my_complex< data_t > datum_new;

    for ( int i_b = 0; i_b < b; ++ i_b )
    {
        my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
        datum = ref_datum;
        datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
        datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
        ref_datum = datum_new;
    }
#else
    // Output array length:
    int plane_area = row_length * M;
    // Element to process:
    int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
    my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );

    // Real and imaginary part of datum (not respectively for odd threads):
    data_t datum_a;
    data_t datum_b;

    // Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
    int parity = ( threadIdx.x % 2 );
    int shuffle_dir = 1 - 2 * parity;
    int inwarp_tid = threadIdx.x % warpSize;

    for ( int i_b = 0; i_b < b; ++ i_b )
    {
        int data_idx = i_b * plane_area + i;
        datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
        datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );

        // Even TIDs compute real part, odd TIDs compute imaginary part:
        reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
    }
#endif // #ifndef CUGALE_MUL_SHUFFLE
}

Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.

The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.

Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:

Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?

As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.

Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).

score 3 · Accepted Answer

这里的问题与编译器如何处理您的代码有关。 nvvp只是尽职尽责地报告运行代码时发生的事情。

如果您cuobjdump -sass在可执行文件上使用该工具，您会发现该duplicate_whole例程正在执行两个 4 字节加载和两个 4 字节存储。这不是最佳的，部分原因是每次加载和存储都有一个步幅（每次加载和存储都会触及内存中的交替元素）。

原因是编译器不知道你的my_complex结构的对齐方式。在阻止编译器生成（合法）8 字节加载的情况下，您的结构将是合法的。正如这里所讨论的，我们可以通过通知编译器我们只打算在 CUDA 8 字节加载是合法的对齐场景中使用该结构来解决这个问题（即它是“自然对齐的”）。对结构的修改如下所示：

template < class T >
struct  __align__(8) my_complex {
   T x;
   T y;
};

随着对代码的更改，编译器会为duplicate_whole内核生成 8 字节的负载，您应该会看到来自分析器的不同报告。仅当您了解其含义并愿意与编译器签订合同以确保情况如此时，才应使用这种装饰。如果你做了一些不寻常的事情，比如不寻常的指针转换，你可能会违反你的交易目的并产生机器故障。

您看不到太多性能差异的原因几乎肯定与 CUDA 加载/存储行为和 GPU缓存有关

当您进行跨步加载时，GPU 无论如何都会加载整个缓存行，即使（在这种情况下）您只需要一半的元素（实际元素）来进行特定的加载操作。但是，无论如何您都需要另一半元素（虚构元素）；它们将在下一条指令上加载，并且由于先前的加载，这条指令很可能会在缓存中命中。

在这种情况下，在跨步存储中，在一条指令中写入跨步元素并在下一条指令中写入备用元素最终将使用其中一个缓存作为“合并缓冲区”。这不是 CUDA 术语中使用的典型意义上的合并；这种合并仅适用于单个指令。然而，缓存“合并缓冲区”行为允许它在该行被写出或驱逐之前“累积”多次写入到已经驻留的行。这大致相当于“回写”缓存行为。

cuda - CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

1 回答 1

Related

Reference