4

我需要一些帮助来优化我的程序中计算量最大的功能。目前,我发现基本(非 SSE)版本明显更快(高达 3 倍)。因此,我请求您帮助纠正这个问题。

该函数在无符号整数向量中查找子集,并报告它们是否存在。为了您的方便,我只包含了相关的代码片段。

首先是基本变体。它检查是否blocks_x.blocks_. (不完全相等。)这些是位图,也就是位向量或位集。

//Check for self comparison
    if (this == &x)
            return false;

//A subset is equal to or smaller.
    if (no_bits_ > x.no_bits_)
            return false;

    int i;

    bool equal = false;

//Pointers should not change.  
    const unsigned int *tptr = blocks_;
    const unsigned int *xptr = x.blocks_;

    
    for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
            if ((*tptr & *xptr) != *tptr)
                    return false;
            if (*tptr != *xptr)
                    equal = true;
    }

    return equal;

然后是 SSE 变体,可惜它没有按照我的预期执行。这两个片段都应该寻找相同的东西。

    //starting pointers.        
    const __m128i* start = (__m128i*)&blocks_;
    const __m128i* xstart = (__m128i*)&x.blocks_;
    
    __m128i block;
    __m128i xblock;

    //Unsigned ints are 32 bits, meaning 4 can fit in a register.
    for (i = 0; i < no_blocks_; i+=4) {

            block = _mm_load_si128(start + i);
            xblock = _mm_load_si128(xstart + i);

                    //Equivalent to  (block & xblock) != block
                    if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
                            return false;

                    //Equivalent to block != xblock
                    if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
                            equal = true;
    }
    return equal;

您对我如何改进 SSE 版本的性能有什么建议吗?难道我做错了什么?或者这是应该在其他地方进行优化的情况?

我还没有在剩余的计算中添加no_blocks_ % 4 != 0,但是在性能提高之前这样做几乎没有什么目的,而且此时只会弄乱代码。

4

2 回答 2

4

There are three possibilities I see here.

First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.

Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.

Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.

Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.

bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
    __m128i eq1 = _mm_setzero_si128();
    __m128i eq2 = _mm_setzero_si128();

    for (unsigned i = 0; i != count; i += 8)
    {
        __m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
        __m128i xb1 = _mm_load_si128((__m128i const*)(b + i));

        eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
        xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));

        __m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
        __m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));

        eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
        xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));

        if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
            return false;
    }

    return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}

If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.

于 2013-07-03T15:54:19.317 回答
0

我似乎您的问题是内存带宽有界问题:渐近您需要大约 2 次操作来处理扫描的内存中的一对整数。没有足够的算术复杂性来利用 CPU SSE 指令的更多算术吞吐量。实际上,您的 CPU 通过大量时间等待数据传输。但是在您的情况下使用 SSE 指令会引发整体指令,并且编译器无法很好地优化生成的代码。

有一些替代策略可以提高带宽有界问题的性能:

  • 多线程通过超线程上下文中的并发算术运算隐藏访问内存。
  • 每次微调数据负载的大小可以提高内存带宽。
  • 通过在循环中添加补充独立操作来提高管道连续性(在“for”循环中的每个步骤扫描两组不同的数据)
  • 在缓存或寄存器中保留更多数据(代码的某些迭代可能需要多次使用同一组数据)
于 2013-12-29T08:57:03.437 回答