opencl - OpenCL 疑虑；如何将内在函数转换为简单的 c 语言？

Question

在 OpenCL 中，代码是这样写的

void unpack_8bit_to_16bit( const __m128i a, __m128i& b0, __m128i& b1 ) 
{
      __m128i zero = _mm_setzero_si128();
      b0 = _mm_unpacklo_epi8( a, zero );
      b1 = _mm_unpackhi_epi8( a, zero );
}

现在我想把这段代码转换成c语言，可以吗？

score 2 · Accepted Answer

As noted in comments, this is not OpenCL code. However, if you meant how to convert this code to OpenCL, then the approach to vectorization is to use vector types, such as float4 (four 32-bit floats), double3 (three 64-bit doubles), long8 (eight 64-bit integers), etc... there are even hardcore types built-in like quad (128-bit float), complex doubles, etc...

In your case, what you essentially want is unpacking a bunch of bytes into 16-bit words, separating the low and high quads of the input. You can do this either by swizzling or by explicitly computing each vector, but there is also an alternate way to do this particular computation - OpenCL has a vector splitting mechanism, which splits an arbitrary vector type into its two lower and higher halves. This is done like this:

float4 input = (float4)(4.3, 0.71, 9.1, 44.8);
float2 inputLo = input.lo; // = (4.3, 0.71)
float2 inputHi = input.hi; // = (9.1, 44.8)

Clearly this is amenable to your problem, since all you have to do is to split your char16 (sixteen 8-bit bytes) into two lower and higher char8's, and interpret these char8's as short8's (since you are unpacking), either by casting or explicitly converting.

Note that this is kind of a weird problem for OpenCL - this unpacking mechanism arises from the way data must be packed into SSE registers, so you constantly have to shuffle bytes around if you want to switch from 8-bit elements to 16-bit. In OpenCL this is unnecessary as you have vector types which don't assume a particular data arrangement (and you can readily convert from one type to another). If your OpenCL kernel happens to be executed on an SSE-capable processor, the kernel compiler will automatically do the packing and unpacking for you - hopefully optimally, if your code is sane.

You can't use intrinsics in OpenCL because kernels don't run exclusively on x86 and x64 hardware - they also run on GPU's, FPGA's, and custom chips. So instead, you use generic vector types which are automatically translated to the proper SIMD instructions on the platform on which the kernel is compiled (actually, it's a bit more complex, but that's the gist of it).

In view of your latest comment, I will add this: if you wish to convert the intrinsics into simple C code, all is needed is an understanding of how data is packed into SSE registers. This is how it works, in basic terms: each SSE registers is 128 bits wide, and can therefore hold either 16 bytes, 8 words, 4 longs, etc... you cannot mix these types, so you can't have for instance 2 bytes and 7 words, each intrinsic assumes a particular type (for instance you could be wanting the square root of each 64-bit double in the register, or the square root of each 32-bit float! clearly which type you choose matters).

These types are always contiguous, so say you want to convert an 8-word vector into two 4-long vectors, i.e. "unpacking" it to be able to do 32-bit calculations on it, this means you want to go from:

[16-bit][16-bit][16-bit][16-bit][16-bit][16-bit][16-bit][16-bit]

to

[32-bit][32-bit][32-bit][32-bit] & [32-bit][32-bit][32-bit][32-bit]

Clearly you cannot just reuse the register, because two 16-bit words would get merged into a single 32-bit value which will produce garbage. Instead, you have to methodically pull each 16-bit word out, cast it to a 32-bit long, and put it into the new register - SSE does all this in hardware (the intrinsic calls the appropriate instruction).

In your particular case, you have a register containing 16 bytes, and you want to "output" data in two other registers, which will instead contain 8 words. So if your input register contains a0..a15 (those are bytes), then you will have:

b0 = (word)a_0..(word)a_7
b1 = (word)a_8..(word)a_15

You can do this in C using arrays, and "simulating" an SSE register (you can do the fancy way with an union containing each possible vector that fits in a register, or just hardcode different array types and convert from one another).

For reference, see this, which explains it a bit (I also recommend you read up on how SSE registers work, because this is the reason packing exists and why it matters).

opencl - OpenCL 疑虑；如何将内在函数转换为简单的 c 语言？

1 回答 1

Related

Reference