image-processing - 二进制图像上的快速像素计数 - ARM neon 内在函数 - iOS 开发

Question

有人能告诉我一个快速计算二值图像中白色像素数量的函数吗？我需要它用于iOS应用程序开发。我正在直接处理定义为的图像的内存

  bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));

我正在实现该功能

             int whiteCount = 0;
             for (int q=i; q<i+windowHeight; q++)
             {
                 for (int w=j; w<j+windowWidth; w++)
                 { 
                     if (imageData[q*W + w] == 1)
                         whiteCount++;
                 }
             }

这显然是可能的最慢的功能。我听说iOS 上的ARM Neon 内部函数可用于在 1 个周期内进行多个操作。也许这就是要走的路？？

问题是我不是很熟悉，目前没有足够的时间学习汇编语言。因此，如果有人可以针对上述问题发布 Neon 内在代码或 C/C++ 中的任何其他快速实现，那就太好了。

我能在网上找到的霓虹内在函数中唯一的代码是 rgb 到灰色的代码 http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion-手机/

score 3 · Accepted Answer

首先，您可以通过分解乘法并摆脱分支来稍微加快原始代码：

 int whiteCount = 0;
 for (int q = i; q < i + windowHeight; q++)
 {
     const bool * const row = &imageData[q * W];

     for (int w = j; w < j + windowWidth; w++)
     { 
         whiteCount += row[w];
     }
 }

（这假设它imageData[]是真正的二进制，即每个元素只能是 0 或 1。）

这是一个简单的 NEON 实现：

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const bool * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += row[j];
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

（这假设它imageData[]是真正的二进制imageWidth <= 2^19，和sizeof(bool) == 1。）

更新版本unsigned char和值 255 为白色，0 为黑色：

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const uint8_t * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        v = vandq_u8(v, v_mask);                    // mask out all but LS bit
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += (row[j] == 255);
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

（这假设imageData[]白色的值为 255，黑色的值为 0，并且imageWidth <= 2^19.）

请注意，以上所有代码都未经测试，可能需要进一步的工作。

score 0 · Accepted Answer

http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

第 6.55.3.6 节

矢量化算法将为您进行比较并将它们放入一个结构中，但您仍然需要遍历结构的每个元素并确定它是否为零。

该循环当前运行的速度有多快，您需要它运行多快？还要记住 NEON 将在与浮点单元相同的寄存器中工作，因此在此处使用 NEON 可能会强制 FPU 上下文切换。

image-processing - 二进制图像上的快速像素计数 - ARM neon 内在函数 - iOS 开发

2 回答 2

Related

Reference