9

最初调查#pragma omp simd指令的效果时,我遇到了一种我无法解释的行为,与简单 for 循环的矢量化有关。如果应用了-O3指令并且我们使用的是 x86 架构,则可以在这个很棒的编译器资源管理器上测试以下代码示例。

有人可以解释一下以下观察背后的逻辑吗?

#include <stdint.h> 

void test(uint8_t* out, uint8_t const* in, uint32_t length)
{
    unsigned const l1 = (length * 32)/32;  // This is vectorized
    unsigned const l2 = (length / 32)*32;  // This is not vectorized

    unsigned const l3 = (length << 5)>>5;  // This is vectorized
    unsigned const l4 = (length >> 5)<<5;  // This is not vectorized

    unsigned const l5 = length -length%32; // This is not vectorized
    unsigned const l6 = length & ~(32 -1); // This is not vectorized

    for (unsigned i = 0; i<l1 /*pick your choice*/; ++i)
    {
      out[i] = in[i*2];
    }
}

令我困惑的是,尽管不能保证是 32 的倍数,但 l1 和 l3 都生成矢量化代码。所有其他长度都不会产生矢量化代码,但应该是 32 的倍数。这背后有什么原因吗?

顺便说一句,使用 #pragma omp simd 指令实际上并没有改变任何东西。

编辑:经过进一步调查,当索引类型为 size_t (甚至不需要边界操作)时,行为差异消失了,这意味着这会生成矢量化代码:

#include <stdint.h> 
#include <string>

void test(uint8_t* out, uint8_t const* in, size_t length)
{
    for (size_t i = 0; i<length; ++i)
    {
        out[i] = in[i*2];
    }
}

如果有人知道为什么循环矢量化如此依赖于索引类型,我很想知道更多!

Edit2,感谢 Mark Lakata,实际上需要 O3

4

2 回答 2

4

The issue is apparent conversion1 from unsigned to size_t in the array index: in[i*2];

If you use l1 or l3 then the computation of i*2 will always fit into the type size_t. This means that the type unsigned practically behaves as if it were size_t.

But when you use the other options, the result of the computation i*2 can possibly not fit into size_t as the value might wrap and the conversion must be made.

if you take your first example, not choosing options l1 or l3, and do the cast:

out[i] = in[( size_t )i*2];

the compiler optimizes, if you cast the whole expression:

out[i] = in[( size_t )(i*2)];

it doesn't.


1 The Standard doesn't actually specify that the type in the index must be size_t, but it is a logical step from the compiler perspective.

于 2016-07-15T14:09:11.657 回答
1

我相信您将优化与矢量化混淆了。我使用了您的编译器资源管理器并为 x86 设置了 -O2,并且没有一个示例是“矢量化的”。

这是l1

test(unsigned char*, unsigned char const*, unsigned int):
        xorl    %eax, %eax
        andl    $134217727, %edx
        je      .L1
.L5:
        movzbl  (%rsi,%rax,2), %ecx
        movb    %cl, (%rdi,%rax)
        addq    $1, %rax
        cmpl    %eax, %edx
        ja      .L5
.L1:
        rep ret

这是l2

test(unsigned char*, unsigned char const*, unsigned int):
        andl    $-32, %edx
        je      .L1
        leal    -1(%rdx), %eax
        leaq    1(%rdi,%rax), %rcx
        xorl    %eax, %eax
.L4:
        movl    %eax, %edx
        addq    $1, %rdi
        addl    $2, %eax
        movzbl  (%rsi,%rdx), %edx
        movb    %dl, -1(%rdi)
        cmpq    %rcx, %rdi
        jne     .L4
.L1:
        rep ret

这并不奇怪,因为您所做的本质上是一个“收集”加载操作,其中加载索引与存储索引不同。x86 中不支持收集/分散。仅在 AVX2 和 AVX512 中引入,未选中。

稍长的代码处理有符号/无符号问题,但没有进行矢量化。

于 2016-07-15T23:05:14.053 回答