c# - Pointers and bit operators in GPU kernels

Question

I want to perform a double threshold on a volume, using a GPU kernel. I send my volume, per slice, as read_only image2d_t. My output volume is a binary volume, where each bit specifies if its related voxel is enabled or disabled. My kernel checks if the current pixel value is within the lower/upper threshold range, and enables its corresponding bit in the binary volume.

For debugging purposes, I left the actual check commented for now. I simply use the passed slice nr to determine if the binary volume bit should be on or off. The first 14 slices are set to "on", the rest to "off". I have also verified this code on the CPU side, the code I pasted at the bottom of this post. The code shows both paths, the CPU being commented now.

The CPU code works as intended, the following image is returned after rendering the volume with the binary mask applied:

Rendering with a correct computed mask

Running the exact same logic using my GPU kernel returns incorrect results (1st 3D, 2nd slice view):

Rendering with an incorrect GPU computed mask

Rendering with an incorrect GPU computed mask (sliceview)

What goes wrong here? I read that OpenCL does not support bit fields, but it does support bitwise operators as far as I could understand from the OpenCL specs. My bit logic, which selects the right bit from the 32 bit word and flips it, is supported right? Or is my simple flag considered a bit field. What it does is select the voxel%32 bit from the left (not the right, hence the subtract).

Another thing could be that the uint pointer passed to my kernel is different from what I expect. I assumed this would be valid use of pointers and passing data to my kernel. The logic applied to the "uint* word" part in the kernel is due to padding words per row, and paddings rows per slice. The CPU variant confirmed that the pointer calculation logic is valid though.

Below; the code

            uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
            uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);

            int[] dims = new int[3];
            dims[0] = volume.Geometry.NumberOfVoxels.X;
            dims[1] = volume.Geometry.NumberOfVoxels.Y;
            dims[2] = volume.Geometry.NumberOfVoxels.Z;

            uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
            unsafe {
                fixed(int* dimPtr = dims) {
                    fixed(uint *arrcPtr = arrC) {
                        // pick Cloo Platform
                        ComputePlatform platform = ComputePlatform.Platforms[0];

                        // create context with all gpu devices
                        ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                            new ComputeContextPropertyList(platform), null, IntPtr.Zero);

                        // load opencl source
                        StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
                        string clSource = streamReader.ReadToEnd();
                        streamReader.Close();

                        // create program with opencl source
                        ComputeProgram program = new ComputeProgram(context, clSource);

                        // compile opencl source
                        program.Build(null, null, null, IntPtr.Zero);

                        // Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
                        // Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
                        // For this reason their use should be avoided if possible.
                        ComputeEventList eventList = new ComputeEventList();

                        // Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
                        ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

                        // Create the kernel function and set its arguments.
                        ComputeKernel kernel = program.CreateKernel("LowerThreshold");

                        int slicenr = 0;
                        foreach (IntPtr ptr in pinnedSlices) {
                            /*// CPU VARIANT FOR TESTING PURPOSES 
                            for (int y = 0; y < dims[1]; y++) {
                                for (int x = 0; x < dims[0]; x++) {
                                    long pixelOffset = x + y * dims[0];
                                    ushort* ushortPtr = (ushort*)ptr;
                                    ushort pixel = *(ushortPtr + pixelOffset);

                                    int BinaryWordShift = 5;
                                    int BinaryWordBits = 32;
                                    if (
                                        (0 <= x) &&
                                        (0 <= y) &&
                                        (0 <= slicenr) &&
                                        (x < dims[0]) &&
                                        (y < dims[1]) &&
                                        (slicenr < dims[2])
                                    ) {
                                        uint* word =
                                            arrcPtr + 1 + (slicenr * wordsPerPlane) +
                                            (y * wordsPerRow) +
                                            (x >> BinaryWordShift);
                                        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
                                        //if (pixel > lowerThreshold && pixel < upperThreshold) {
                                        if (slicenr < 15) {
                                            *word |= mask;
                                        } else {
                                            *word &= ~mask;
                                        }
                                    }
                                }
                            }*/

                            ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
                                context,
                                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
                                3,
                                new IntPtr(dimPtr));

                            ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
                            ComputeImage2D image2D = new ComputeImage2D(
                                context, 
                                ComputeMemoryFlags.ReadOnly, 
                                format, 
                                volume.Geometry.NumberOfVoxels.X, 
                                volume.Geometry.NumberOfVoxels.Y, 
                                0, 
                                ptr
                            );

                            // The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
                            ComputeBuffer<uint> c = new ComputeBuffer<uint>(
                                context, ComputeMemoryFlags.WriteOnly, arrC.Length);

                            kernel.SetMemoryArgument(0, image2D);
                            kernel.SetMemoryArgument(1, dimsBuffer);
                            kernel.SetValueArgument(2, wordsPerRow);
                            kernel.SetValueArgument(3, wordsPerPlane);
                            kernel.SetValueArgument(4, slicenr);
                            kernel.SetValueArgument(5, lowerThreshold);
                            kernel.SetValueArgument(6, upperThreshold);
                            kernel.SetMemoryArgument(7, c);

                            // Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
                            // If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
                            commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);

                            // Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer 
                            // will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete 
                            // by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
                            // eventList will contain two events after this method returns.
                            commands.ReadFromBuffer(c, ref arrC, false, eventList);

                            // A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
                            // in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands 
                            // to finish has to be issued before "arrC" can be used. 
                            // This explicit synchronization can be achieved in two ways:
                            // 1) Wait for the events in the list to finish,
                            //eventList.Wait();
                            //}
                            // 2) Or simply use
                            commands.Finish();

                            slicenr++;
                        }

                    }
                }
            }

And my kernel code:

const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP |   CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
    read_only image2d_t image,
    global int* brickSize,
    uint wordsPerRow,
    uint wordsPerPlane,
    int slicenr,
    int lower,
    int upper,
    global write_only uint* c )
{

    int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
    uint4 pixel = read_imageui(image, smp, coord.xy);

    uchar BinaryWordShift = 5;
    int BinaryWordBits = 32;
    if (
            (0 <= coord.x) &&
            (0 <= coord.y) &&
            (0 <= coord.z) &&
            (coord.x < brickSize[0]) &&
            (coord.y < brickSize[1]) &&
            (coord.z < brickSize[2])
    ) {
        global uint* word =
            c + 1 + (coord.z * wordsPerPlane) +
            (coord.y * wordsPerRow) +
            (coord.x >> BinaryWordShift);

        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
        //if (pixel.w > lower && pixel.w < upper) {
        if (slicenr < 15) {
            *word |= mask;
        } else {
            *word &= ~mask;
        }
    }
}

score 1 · Accepted Answer

两个问题：

您已将“c”声明为“write_only”，但使用“|=”和“&=”运算符，它们是读-修改-写
正如其他海报所提到的，如果两个工作项正在访问同一个单词，则 read-modify-write 之间存在竞争条件，这将导致错误。原子操作比非原子操作慢得多，所以虽然可能，但不推荐。

我建议您将输出放大 8 倍并使用字节而不是位。这将使您的输出只写，并且还将消除争用，从而消除竞争条件。

或者（如果数据紧凑性或格式很重要）每个工作项一次处理 8 个元素，并将复合 8 位输出写入单个字节。这将是只写的，没有争用，并且仍然具有您的数据紧凑性。

c# - Pointers and bit operators in GPU kernels

1 回答 1

Related

Reference