12

我试图找到一种方法来执行间接左/右移位操作,而无需实际使用变量移位操作或任何分支。

我正在研究的特定 PowerPC 处理器有一个按常数立即移位的怪癖,比如

int ShiftByConstant( int x ) { return x << 3 ; } 

是快速的、单操作的和超标量的,而一个移位变量,比如

int ShiftByVar( int x, int y ) { return x << y ; }

是一个微编码操作,需要 7-11 个周期才能执行,而管道的整个其余部分都停止了

我想做的是找出将sraw解码到哪个非微编码整数 PPC 操作,然后单独发出它们。这对自身的延迟没有帮助sraw——它将用六个操作替换一个操作——但在这六个操作之间,我可以将一些工作双分派给其他执行单元并获得净收益。

我似乎在任何地方都找不到 μops sraw 解码到的任何地方——有谁知道我如何用一系列常量移位和基本整数运算替换变量移位?(for 循环或 switch 或任何带有分支的东西都不起作用,因为分支惩罚甚至比微码惩罚更大,即使对于正确预测的分支也是如此。)

这不需要在汇编中回答;我希望学习算法而不是特定代码,因此 C 或高级语言甚至伪代码的答案将非常有帮助。

编辑:我应该添加一些澄清:

  1. 我什至一点都不担心便携性
  2. PPC 有条件移动,所以我们可以假设存在一个无分支的内在函数

    int isel(a, b, c)  { return a >= 0 ? b : c; }
    

    (如果你写出一个做同样事情的三元组,我会明白你的意思)

  3. 整数乘法也是微编码的,甚至比sraw. :-(
  4. 在 Xenon PPC 上,预测分支的延迟为 8 个周期,因此即使是一个也使其与微编码指令一样昂贵。跳转到指针(任何间接分支或函数指针)是有保证的错误预测,24 个周期的停顿。
4

8 回答 8

8

Here you go...

I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).

The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.

template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}

EDIT: Note on isel() I saw your isel() code on your website.

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:

return (x & (~mask)) + (y & mask);
于 2009-10-21T21:35:16.593 回答
5

这个怎么样:

if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;

如果您有其他代码要执行,则可能需要更长的时间才能执行,但更容易交错。

于 2009-02-12T03:19:10.577 回答
4
于 2009-02-12T04:06:23.350 回答
1

这个让我头疼。我现在已经放弃了六个想法。他们都利用了这样一个概念,即向自身添加一个东西左移 1,对结果左移 4 做同样的事情,依此类推。如果您保留左移 0、1、2、4、8 和 16 的所有部分结果,那么通过测试移位变量的位 0 到 4,您可以获得初始移位。现在再做一次,移位变量中的每个 1 位一次。坦率地说,你不妨把你的处理器送出去喝咖啡。

我寻求真正帮助的一个地方是 Hank Warren's Hacker's Delight(这是这个答案中唯一有用的部分)。

于 2009-02-12T03:27:51.087 回答
0

这个怎么样:

int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};

int ShiftByVar( int x, int y )
{
    //return x << y;
    return x * multiplicands[y];
}
于 2009-02-12T03:33:56.187 回答
0

If the shift count can be calculated far in advance then I have two ideas that might work

  • Using self-modifying code

    Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift

  • Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction

    // shift by constant functions
    typedef int (*shiftFunc)(int);    // the shift function
    #define SHL(n) int shl##n(int x) { return x << (n); }
    SHL(1)
    SHL(2)
    SHL(3)
    ...
    shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
    
    int arr[MAX];       // all the values that need to be shifted with the same amount
    shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
    for (int i = 0; i < MAX; i++)
        arr[i] = shl(arr[i]);
    

    This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.

    Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD


If the range of the values is small, lookup table is another possible solution

#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                    S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
    { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
    { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
    ...
    { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}

Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together

于 2019-01-25T12:01:51.687 回答
-1

There is some good stuff here regarding bit manipulation black magic: Advanced bit manipulation fu (Christer Ericson's blog)

Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.

于 2009-02-12T04:27:19.337 回答
-1

Here's something that is trivially unrollable:

int result= value;

int shift_accumulator= value;

for (int i= 0; i<5; ++i)
{
    result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
    shift_accumulator += shift_accumulator;
    k >>= 1;
}
于 2009-08-24T17:34:43.943 回答