c - 图像的快速转置和 C 中的 Sobel 滤波器优化 (SIMD)

Question

我想为我和我的朋友写的光线追踪器实现一个非常（非常）快速的Sobel 算子（可以在此处找到源代码）。以下是我到目前为止所知道的......

首先，假设图像是在 8 位无符号整数数组中逐行存储的灰度图片。

要编写一个真正的 Sobel 滤波器，我需要计算每个像素的 Gx 和 Gy。由于原点旁边的 6 个像素，这些数字中的每一个都是计算出来的。但 SIMD 指令允许我处理 16 甚至 32 (AVX) 像素。希望运算符的内核具有一些不错的属性，以便我可以通过以下方式计算 Gy：

减去每个 i 和 i+2 行并将结果存储在其他图片（数组）的 i+1 行中
将 i, i+1 和 i+2 列的两次相加得到最终图片的 i+1 列

我会做同样的事情（但转置）来计算 Gx 然后添加两张图片。

一些注意事项：

我不关心内存分配，因为一切都会在开始时分配。
我可以处理将值除以四的溢出和符号问题（感谢_mm_srli_epi8） (uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t //some precision is lost but I don't care

我面临的真正问题是从行到列。因为否则我无法将图片加载到 SIMD 寄存器中。我必须至少翻转图像三次，不是吗？

原图一次。然后我可以计算 Gx 和 Gy 的第一步，然后翻转生成的图片以计算第二步。

所以，这是我的问题：

这种实现是个好主意吗？
有没有办法比哑算法更快地转置数组？（我不这么认为）
瓶颈会在哪里？（任何猜测？：P）

score 8 · Accepted Answer

我认为 transpose/2-pass 不利于优化 Sobel Operator 代码。Sobel Operator 不是计算函数，因此在这种情况下浪费内存访问进行转置/2-pass 访问并不好。我写了一些 Sobel Operator 测试代码来看看 SSE 能有多快。此代码不处理第一个和最后一个边缘像素，并使用 FPU 计算 sqrt() 值。

Sobel 运算符需要 14 个乘法、1 个平方根、11 个加法、2 个最小/最大、12 个读访问和 1 个写访问运算符。这意味着如果您优化代码，您可以在 20~30 个周期内处理一个组件。

FloatSobel() 函数需要 2113044 个 CPU 周期来处理 256 * 256 图像处理 32.76 周期/组件。我会将此示例代码转换为 SSE。

void FPUSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; ++x)
        {
            float gx =  image_0[x-1] * (+1.0f) + 
                        image_0[x+1] * (-1.0f) +
                        image_1[x-1] * (+2.0f) + 
                        image_1[x+1] * (-2.0f) +
                        image_2[x-1] * (+1.0f) + 
                        image_2[x+1] * (-1.0f);

            float gy =  image_0[x-1] * (+1.0f) + 
                        image_0[x+0] * (+2.0f) + 
                        image_0[x+1] * (+1.0f) +
                        image_2[x-1] * (-1.0f) + 
                        image_2[x+0] * (-2.0f) + 
                        image_2[x+1] * (-1.0f);


            int result = (int)min(255.0f, max(0.0f, sqrtf(gx * gx + gy * gy)));

            screen[x] = 0x01010101 * result;
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

SseSobel() 函数需要 613220 个 CPU 周期来处理相同的 256*256 图像。它比 FPUSobel() 花费了 9.51 个周期/组件和 3.4 倍。有一些空间需要优化，但速度不会超过 4 倍，因为它使用了 4 路 SIMD。

此函数使用 SoA 方法一次处理 4 个像素。在大多数数组或图像数据中，SoA 优于 AoS，因为您必须转置/洗牌才能使用 AoS。并且 SoA 更容易将通用 C 代码更改为 SSE 代码。

void SseSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    __m128 const_p_one = _mm_set1_ps(+1.0f);
    __m128 const_p_two = _mm_set1_ps(+2.0f);
    __m128 const_n_one = _mm_set1_ps(-1.0f);
    __m128 const_n_two = _mm_set1_ps(-2.0f);

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; x+=4)
        {
            // load 16 components. (0~6 will be used)
            __m128i current_0 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_0+x-1)), _mm_setzero_si128());
            __m128i current_1 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_1+x-1)), _mm_setzero_si128());
            __m128i current_2 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_2+x-1)), _mm_setzero_si128());

            // image_00 = { image_0[x-1], image_0[x+0], image_0[x+1], image_0[x+2] }
            __m128 image_00 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_0, _mm_setzero_si128()));
            // image_01 = { image_0[x+0], image_0[x+1], image_0[x+2], image_0[x+3] }
            __m128 image_01 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 2), _mm_setzero_si128()));
            // image_02 = { image_0[x+1], image_0[x+2], image_0[x+3], image_0[x+4] }
            __m128 image_02 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 4), _mm_setzero_si128()));
            __m128 image_10 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_1, _mm_setzero_si128()));
            __m128 image_12 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_1, 4), _mm_setzero_si128()));
            __m128 image_20 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_2, _mm_setzero_si128()));
            __m128 image_21 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 2), _mm_setzero_si128()));
            __m128 image_22 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 4), _mm_setzero_si128()));

            __m128 gx = _mm_add_ps( _mm_mul_ps(image_00,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_02,const_n_one),
                        _mm_add_ps( _mm_mul_ps(image_10,const_p_two),
                        _mm_add_ps( _mm_mul_ps(image_12,const_n_two),
                        _mm_add_ps( _mm_mul_ps(image_20,const_p_one),
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 gy = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), 
                        _mm_add_ps( _mm_mul_ps(image_01,const_p_two), 
                        _mm_add_ps( _mm_mul_ps(image_02,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_20,const_n_one), 
                        _mm_add_ps( _mm_mul_ps(image_21,const_n_two), 
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 result = _mm_min_ps( _mm_set1_ps(255.0f), 
                            _mm_max_ps( _mm_set1_ps(0.0f), 
                                        _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(gx, gx), _mm_mul_ps(gy,gy))) ));

            __m128i pack_32 = _mm_cvtps_epi32(result); //R32,G32,B32,A32
            __m128i pack_16 = _mm_packs_epi32(pack_32, pack_32); //R16,G16,B16,A16,R16,G16,B16,A16
            __m128i pack_8 = _mm_packus_epi16(pack_16, pack_16); //RGBA,RGBA,RGBA,RGBA
            __m128i unpack_2 = _mm_unpacklo_epi8(pack_8, pack_8); //RRGG,BBAA,RRGG,BBAA
            __m128i unpack_4 = _mm_unpacklo_epi8(unpack_2, unpack_2); //RRRR,GGGG,BBBB,AAAA

            _mm_storeu_si128((__m128i*)(screen+x),unpack_4);
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

score 2 · Accepted Answer

对于@zupet 的答案中的代码：
我不会乘以一（const_p_one），而是……什么都不做。编译器可能不会优化它。
我不会乘以 2，而是自己添加；比使用整数算术的 mul 快。但是对于 FP，它主要是避免需要另一个向量常数。Haswell 的 FP add 吞吐量比 FP mul 差，但 Skylake 和 Zen 是平衡的。

不是乘以-1.0，而是用_mm_xor_pswith取-0.0反来翻转符号位。

我将独立并并排计算 pos 和 neg 项，而不是一个接一个地计算（为了更好的流水线），最后使用相同的算术和 sub 。等等等等……还有很多待改进

使用 AVX+FMA 可以_mm_fma_ps更快。

c - 图像的快速转置和 C 中的 Sobel 滤波器优化 (SIMD)

2 回答 2

Related

Reference