c++ - 针对 SSE2 和 SSE3 优化的 OS 便携式 memcpy

Question

如果我要编写一个memcpy针对 SSE2/SSE3 优化的 OS 便携版，那会是什么样子？我想同时支持 GCC 和 ICC 编译器。我问的原因是它memcpy是用 glibc 中的汇编代码编写的，并且没有针对 SSE2/SSE3 进行优化，其他通用memcpy实现可能无法充分利用数据对齐和大小等方面的系统功能。

这是我目前memcpy考虑到数据对齐并针对 SSE2（我认为）而不是针对 SSE3 进行优化的电流：

#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
    char *s1 = b;
    const char *s2 = a;
    for(; 0<n; --n)*s1++ = *s2++;
    return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
    // Use system memcpy()
    return memcpy(dest, source, count);
#else

    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);

    // Copy 64-bit blocks first
    _UINT64 *sourcePtr8 = (_UINT64*)source;
    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    // Copy 32-bit blocks
    _UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    // Copy 16-bit blocks
    _UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];

    if (!bytesLeft) return dest;

    // Copy byte blocks
    _UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
    return dest;
#endif
}
#endif

并非所有memcpy实现都是线程安全的，这只是制作我们自己的版本的另一个原因。所有这一切让我得出结论，我至少应该尝试制作一个线程安全的操作系统，该操作系统memcpy在可用的情况下针对 SSE2/SSE3 进行了优化。

我还读到 GCC 支持使用编译器选项进行积极展开-funroll-loops，如果没有明显的缓存未命中，这是否可以提高 SSE2 和/或 SSE3 的性能？

为 32 位和 64 位架构制作不同的 memcpy 版本是否会提高性能？

在复制之前预对齐内部内存缓冲区是否有任何性能提升？

如何使用#pragma loop来控制 SSE2/SSE3 自动并行器如何考虑循环代码？假设一个可以#pragma loop在连续数据上使用的区域由 for() 循环移动。

在添加我自己的 GCC 时，-fno-builtin-memcpy我是否需要使用 GCC 编译器选项-O3来强制编译器内联 GCC ？或者也许只是在我的代码中覆盖就足够了？memcpymemcpymemcpy

更新：经过一些测试，在我看来，优化的 SSE2memcpy()并没有那么快，值得付出努力。我在 Intel C/C++ Compiler forums 上问过这方面的问题。

c++ - 针对 SSE2 和 SSE3 优化的 OS 便携式 memcpy

0 回答 0

Related

Reference