在我的代码中,以下几行当前是热点:
int table1[256] = /*...*/;
int table2[512] = /*...*/;
int table3[512] = /*...*/;
int* result = /*...*/;
for(int r = 0; r < r_end; ++r)
{
std::uint64_t bits = bit_reader.value(); // 64 bits, no assumption regarding bits.
// The get_ functions are table lookups from the highest word of the bits variable.
struct entry
{
int sign_offset : 5;
int r_offset : 4;
int x : 7;
};
// NOTE: We are only interested in the highest word in the bits variable.
entry e;
if(is_in_table1(bits)) // branch prediction should work well here since table1 will be hit more often than 2 or 3, and 2 more often than 3.
e = reinterpret_cast<const entry&>(table1[get_table1_index(bits)]);
else if(is_in_table2(bits))
e = reinterpret_cast<const entry&>(table2[get_table2_index(bits)]);
else
e = reinterpret_cast<const entry&>(table3[get_table3_index(bits)]);
r += e.r_offset; // r is 18 bits, top 14 bits are always 0.
int x = e.x; // x is 14 bits, top 18 bits are always 0.
int sign_offset = e.sign_offset;
assert(sign_offset <= 16 && sign_offset > 0);
// The following is the hotspot.
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
(*result++) = ((x << 18) * sign) | r; // 32 bits
// End of hotspot
bit_reader.skip(sign_offset); // sign_offset is the last bit used.
}
虽然我还没有想出如何进一步优化这一点,但可能来自操作的内在函数 Bit-Granularity,__shiftleft128
或者_rot
可能有用?
请注意,我还在 GPU 上对结果数据进行处理,所以重要的是得到result
GPU 可以用来计算正确的数据。
建议?
编辑:
添加了表格查找。
编辑:
int sign = 1 - (bits >> (63 - e.sign_offset) & 0x2);
000000013FD6B893 and ecx,1Fh
000000013FD6B896 mov eax,3Fh
000000013FD6B89B sub eax,ecx
000000013FD6B89D movzx ecx,al
000000013FD6B8A0 shr r8,cl
000000013FD6B8A3 and r8d,2
000000013FD6B8A7 mov r14d,1
000000013FD6B8AD sub r14d,r8d