c++ - 使用 SSE 内在函数注册的短缺

Question

在这篇SSE 加载/存储内存事务中，我询问了显式寄存器内存事务和中间指针之间的区别。在实践中，中间指针表现出稍高的性能，但是，从硬件角度来说，中间指针是什么还不清楚？如果创建了指针，是否意味着某些寄存器也被占用，或者在某些 SSE 操作期间发生了寄存器的调用（例如_mm_mul）？

让我们考虑这个例子：

struct sse_simple
{
    sse_simple(unsigned int InputLength):
        Len(InputLength/4),
        input1((float*)_mm_malloc((float *)_mm_malloc(cast_sz*sizeof(float), 16))),
        input2((float*)_mm_malloc((float *)_mm_malloc(cast_sz*sizeof(float), 16))),
        output((float*)_mm_malloc((float *)_mm_malloc(cast_sz*sizeof(float), 16))),
        inp1_sse(reinterpret_cast<__m128*>(input1)),
        inp1_sse(reinterpret_cast<__m128*>(input2)),
        output_sse(reinterpret_cast<__m128*>(output))
    {}

    ~sse_simple()
    {
        _mm_free(input1);
        _mm_free(input2);
        _mm_free(output);
    }

    void func()
    {
        for(auto i=0; i<Len; ++i)
            output_sse[i] = _mm_mul(inp1_sse[i], inp2_sse[i]);
    }

    float *input1;
    float *input2;
    float *output; 

    __m128 *inp1_sse;
    __m128 *inp2_sse;
    __m128 *output_sse;

    unsigned int Len;
};

在上面的示例中，中间指针 inp1_sse、inp2_sse 和 output_sse 在构造函数中创建一次。如果我复制大量 sse_simple 对象（例如 50 000 或更多），这会导致寄存器短缺吗？

score 2 · Accepted Answer

首先，寄存器是与计算单元接近（意味着访问速度非常快）的小型存储器。编译器尽量使用它们来加速计算，但当它不能使用时，它会使用内存。由于寄存器中存储的内存量很小，通常寄存器在计算过程中只作为临时使用。大多数情况下，除了循环索引等临时变量外，所有内容最终都存储在内存中......因此，寄存器的短缺只会减慢计算速度。

在计算期间，指针存储在通用寄存器（GPR）中，无论它们指向浮点数、向量还是其他，而向量__m128存储在特定寄存器中。

因此，在您的示例中，树数组将存储在内存中，并且行

output_sse[i] = _mm_mul(inp1_sse[i], inp2_sse[i]);

编译为：

movaps -0x30(%rbp),%xmm0    # load inp1_sse[i] in register %xmm0
movaps -0x20(%rbp),%xmm1    # load inp2_sse[i] in register %xmm1
mulps  %xmm1,%xmm0          # perform the multiplication the result is stored in %xmm0
movaps %xmm0,(%rdx)         # store the result in memory

如您所见，指针使用寄存器%rbp和%rdx.

c++ - 使用 SSE 内在函数注册的短缺

1 回答 1

Related

Reference