performance - 在 D 语言中比较两个内存块的最有效方法是什么？

Question

我需要一个内存块的比较函数，以便在 D 编程语言中对字节数组进行二进制搜索。它不需要任何有用的语义。它只需要快速并且是有效的比较函数（产生总排序的函数）。已知要比较的内存块具有相同的长度。

Cmemcmp实际上很慢，因为它试图保留有用的字符串比较语义，而我不需要。以下是迄今为止我想出的最好的。有没有人知道更好的东西，最好不要浸入非便携式 CPU 特定指令？

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

编辑：我已经阅读了 SSE，我不想使用它有 3 个原因：

它不是便携式的。
它需要在 ASM 中编程。
比较说明假定您的数据是浮点数，如果某些数据恰好与 NaN 的模式匹配，这可能会出现问题。

score 3 · Accepted Answer

你可以试试：

检查 uint 是否是适合您的目标 CPU 的最大类型（ulongs 可能更适合本机寄存器）
使用 2 个该类型的指针
使用 *p++ 使用 2 个局部变量（不要为 1 个值取消引用指针 2 次）
预先分配第一个循环的计数器（使用 while (counter--)）
通过用开关替换第二个循环来展开第二个循环（如果 sizeof（适合寄存器的类型）已知并且将始终相同。）

编辑：如果第一个循环是瓶颈，展开可能是答案。结合在相等值的情况下将条件数量减半，展开 4 次我得到类似：

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

当然，(n / uint.sizeof) % 4剩下的迭代要做，你可以通过交错一个开关来混合到这个循环中，我把它作为练习留给读者邪恶的笑容。

score 2 · Accepted Answer

我对此了解不多，但有一些向量指令可以一次将指令应用于多个字节。你可以使用这些结果来做一个非常快速的 memcmp。我不知道您需要哪些说明，但如果您查找 Larrabee 新说明或查看这篇文章，您可能会找到您正在寻找的内容http://www.ddj.com/architect/216402188

注意：这个 CPU 不是 ATM AFAIK

-编辑-现在我很肯定有一个指令集（尝试查看 SSE 或 SSE2），如果它们对齐，可以一次比较 16 个字节。

你可以试试这个纯 c++ 代码。

template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = lhs + n/sizeof(T);
    while(lhs<endLHS) {
        int i = *lhs - *rhs;
        if(i != 0) return i > 0 ? 1 : -1;
        lhs++;
        rhs++;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

这里的优点是您增加了指针，因此您可以取消引用它而不应用索引（它的 ptr_offset [index] vs ptr_offset）。以上使用模板，因此您可以在 64 位机器上使用 64 位。组装中的 CMP 实际上只是减去检查 N 和 Z 标志。而不是比较 N 和减少 N 我只是在我的版本中进行比较。

score 1 · Accepted Answer

如果您相信您的编译器的优化，您可以尝试对 acidzombie24s 的建议进行一些修改：

template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = &lhs[n];
    while(lhs<endLHS) {
        //A good optimiser will keep these values in register
        //and may even be clever enough to just retest the flags
        //before incrementing the pointers iff we loop again.
        //gcc -O3 did the optimisation very well.
        if (*lhs > *rhs) return 1;
        if (*lhs++ < *rhs++) return -1;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

这是上述 C 版本的 x86 汇编代码中的整个 gcc -O3 优化内部循环，仅传递 char 数组指针：

Loop:
        incl    %eax         ; %eax is lhs
        incl    %edx         ; %edx is rhs
        cmpl    %eax, %ebx   ; %ebx is endLHS
        jbe     ReturnEq
        movb    (%edx), %cl
        cmpb    %cl, (%eax)
        jg      ReturnGT
        jge     Loop
ReturnLT:
        ...

score 1 · Accepted Answer

我认为 memcmp 被指定为进行逐字节比较，而不管数据类型如何。您确定您的编译器的实现保留了字符串语义吗？它不应该。

score 1 · Accepted Answer

这个问题的答案对你有帮助吗？

如果编译器支持将 memcmp() 实现为内在/内置函数，那么您似乎很难超越它。

我不得不承认对 D 几乎一无所知，所以我不知道 D 编译器是否支持内在函数。

score 1 · Accepted Answer

很大程度上取决于您的系统和数据。我们只能做出这么多的假设。你用的是什么处理器？它必须是直接的C代码吗？CPU寄存器的宽度是多少？CPU的缓存结构是怎样的？等等等等

这还取决于您的数据有多么不同。如果每个缓冲区的第一个字节不太可能相同，那么函数的速度就毫无意义，因为从逻辑上讲，它不会到达函数的其余部分。如果前 n-1 个字节可能通常是 sme，那么它就变得更加重要。

无论您如何进行测试，您都不太可能看到太大的变化。

无论如何，这是我自己的一个小实现，它可能会，也可能不会，比你自己的更快（或者，鉴于我只是在我前进的过程中编造出来，它可能会，也可能不会，工作;））：

int memoryCompare(const void* lhs, const void* rhs, size_t n) 
{
    uint_64 diff    = 0

    // Test the first few bytes until we are 32-bit aligned.
    while( (n & 0x3) != 0 && diff != 0 )
    {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;              
        }

    // Test the next set of 32-bit integers using comparisons with
    // aligned data.
    while( n > sizeof( uint_32 ) && diff != 0 )
    {
        diff = (uint_32*)lhs - (uint_32*)rhs;
        n   -= sizeof( uint_32 );
        ((uint_32*)lhs)++;
        ((uint_32*)rhs)++;
    }

    // now do final bytes.
    while( n > 0 && diff != 0 ) 
        {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;  
        }

    return (int)*diff / abs( diff ));
}

performance - 在 D 语言中比较两个内存块的最有效方法是什么？

6 回答 6

Related

Reference