c++ - 为什么我的手动调优、支持 SSE 的代码这么慢？

Question

长话短说：我正在用 C++ 开发一个计算密集型图像处理应用程序。它需要在从较大图像中提取的小像素块上计算图像扭曲的许多变体。该程序的运行速度不如我想的那么快。分析 (OProfile) 显示扭曲/插值函数消耗超过 70% 的 CPU 时间，因此尝试优化它似乎很明显。

到目前为止，我一直在使用 OpenCV 图像处理库来完成这项任务：

// some parameters for the image warps (position, stretch, skew)
struct WarpParams;

void Image::get(const WarpParams &params)
{
    // fills matrices mapX_ and mapY_ with x and y coordinates of points to be
    // inteprolated.
    updateCoordMaps(params); 
    // perform interpolation to obtain pixels at point locations 
    // P(mapX_[i], mapY_[i]) based on the original data and put the 
    // result in pixels_. Use bicubic inteprolation.
    cv::remap(image_->data(), pixels_, mapX_, mapY_, CV_INTER_CUBIC);
}

我编写了自己的插值函数并将其放入测试工具中，以确保在我进行实验时的正确性，并将其与旧函数进行基准测试。

我的功能运行非常缓慢，这是意料之中的。一般来说，这个想法是：

遍历mapX_、mapY_坐标图，提取下一个待插值像素的（实值）坐标；
从插值像素周围的原始图像中检索一个 4x4 的像素值（整数坐标）邻域；
计算这16个像素中每个像素的卷积核系数；
将插值像素的值计算为 16 个像素值和内核系数的线性组合。

在我的 Wolfdale Core2 Duo 上，旧功能的计时时间为 25us。新的花了 587us (!)。我急切地戴上巫师帽，开始破解代码。我设法删除了所有分支，省略了一些重复计算，并将 3 个嵌套循环转换为坐标图上的一个。这就是我想出的：

void Image::getConvolve(const WarpParams &params)
{
    __declspec(align(16)) static float kernelX[4], kernelY[4];

    // grab pointers to coordinate map matrices and the original image
    const float 
        *const mapX = mapX_.ptr<float>(),
        *const mapY = mapY_.ptr<float>(),
        *const img  = image_->data().ptr<float>();
    // grab pointer to the output image
    float *const subset = pixels_.ptr<float>(),
        x, y, xint, yint;

    const ptrdiff_t imgw = image_->width();
    ptrdiff_t imgoffs; 

    __m128 v_px, v_kernX, v_kernY, v_val;

    // iterate over the coordinate matrices as linear buffers
    for (size_t idx = 0; idx < pxCount; ++idx)
    {
        // retrieve coordinates of next pixel from precalculated maps,
        // break up each into fractional and integer part
        x = modf(mapX[idx], &xint);
        y = modf(mapY[idx], &yint);
        // obtain offset of the top left pixel from the required 4x4 
        // neighborhood of the current pixel in the image's 
        // buffer (sadly, the position will be unaligned)
        imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1;

        // calculate all 4 convolution kernel values for every row and 
        // every column
        tap4Kernel(x, kernelX);
        tap4Kernel(y, kernelY);

        // load the kernel values for the columns, these don't change
        v_kernX = _mm_load_ps(kernelX);

        // process a row of the 4x4 neighborhood
        // get set of convolution kernel values for the current row
        v_kernY = _mm_set_ps1(kernelY[0]);     
        v_px    = _mm_loadu_ps(img + imgoffs); // load the pixel values
        // calculate the linear combination of the pixels with kernelX
        v_px    = _mm_mul_ps(v_px, v_kernX);   
        v_px    = _mm_mul_ps(v_px, v_kernY);   // and kernel Y
        v_val   = v_px;                        // add result to the final value
        imgoffs += imgw;
        // offset points now to next row of the 4x4 neighborhood

        v_kernY = _mm_set_ps1(kernelY[1]);
        v_px    = _mm_loadu_ps(img + imgoffs);
        v_px    = _mm_mul_ps(v_px, v_kernX);
        v_px    = _mm_mul_ps(v_px, v_kernY);
        v_val   = _mm_add_ps(v_val, v_px);
        imgoffs += imgw;

        /*... same for kernelY[2] and kernelY[3]... */

        // store resulting interpolated pixel value in the subset's
        // pixel matrix
        subset[idx] = horizSum(v_val);
    }
}

// Calculate all 4 values of the 4-tap convolution kernel for 4 neighbors
// of a pixel and store them in an array. Ugly but fast.
// The "arg" parameter is the fractional part of a pixel's coordinate, i.e.
// a number in the range <0,1)
void Image::tap4Kernel(const float arg, float *out)
{
    // chaining intrinsics was slower, so this is done in separate steps
    // load the argument into 4 cells of a XMM register
    __m128
        v_arg   = _mm_set_ps1(arg),
        v_coeff = _mm_set_ps(2.0f, 1.0f, 0.0f, -1.0f);

    // subtract vector of [-1, 0, 1, 2] to obtain coorinates of 4 neighbors
    // for kernel calculation
    v_arg = _mm_sub_ps(v_arg, v_coeff);
    // clear sign bits, this is equivalent to fabs() on all 4
    v_coeff = _mm_set_ps1(-0.f);
    v_arg = _mm_andnot_ps(v_coeff, v_arg); 

    // calculate values of abs(argument)^3 and ^2
    __m128
        v_arg2 = _mm_mul_ps(v_arg, v_arg),
        v_arg3 = _mm_mul_ps(v_arg2, v_arg),
        v_val, v_temp;

    // calculate the 4 kernel values as 
    // arg^3 * A + arg^2 * B + arg * C + D, using 
    // (A,B,C,D) = (-0.5, 2.5, -4, 2) for the outside pixels and 
    // (1.5, -2.5, 0, 1) for inside
    v_coeff = _mm_set_ps(-0.5f, 1.5f, 1.5f, -0.5f);
    v_val   = _mm_mul_ps(v_coeff, v_arg3);
    v_coeff = _mm_set_ps(2.5f, -2.5f, -2.5f, 2.5f);
    v_temp  = _mm_mul_ps(v_coeff, v_arg2);
    v_val   = _mm_add_ps(v_val, v_temp);
    v_coeff = _mm_set_ps(-4.0f, 0.0f, 0.0f, -4.0f),
    v_temp  = _mm_mul_ps(v_coeff, v_arg);
    v_val   = _mm_add_ps(v_val, v_temp);
    v_coeff = _mm_set_ps(2.0f, 1.0f, 1.0f, 2.0f);
    v_val   = _mm_add_ps(v_val, v_coeff);

    _mm_store_ps(out, v_val);
}

我很高兴能够将运行时间控制在 40us 以下，甚至在将 SSE 引入主循环之前，我最后保存了它。我原本预计至少有 3 倍的加速，但它在 36us 时只快了一点点，比我试图改进的旧 get() 慢。更糟糕的是，当我更改基准循环以进行更多运行时，旧函数具有相同的平均运行时间，而我的延长到超过 127us，这意味着一些极端的扭曲参数值需要更长的时间（这是有道理的，因为更多扭曲意味着我需要从原始图像中获取广泛分散的像素值来计算结果）。

我认为原因一定是未对齐的负载，但这无济于事（我需要达到不可预测的像素值）。我在优化部门看不到更多可以做的事情，所以我决定查看 cv::remap() 函数，看看他们是如何做到的。想象一下，我惊讶地发现它包含一堆嵌套循环和大量分支。他们还做了很多我不需要费心的论证验证。据我所知（代码中没有注释），SSE（也有未对齐的负载！）仅用于从坐标图中提取值并将它们四舍五入为整数，然后调用一个函数来执行实际的插值使用常规浮点算术。

我的问题是，为什么我失败得如此悲惨（为什么我的代码这么慢，为什么他们的代码更快，即使它看起来很乱），我能做些什么来改进我的代码？

我没有在此处粘贴 OpenCV 代码，因为这已经太长了，您可以在pastebin中查看。

我在 VC++2010 下以 Release 模式测试和编译了我的代码。使用的 OpenCV 是 v2.3.1 的预编译二进制包。

编辑：像素值是 0..1 范围内的浮点数。分析显示 tap4Kernel() 函数不相关，大部分时间都花在 getConvolve() 中。

EDIT2：我将生成的代码的反汇编粘贴到pastebin。这是在旧的 Banias Celeron 处理器（具有 SSE2）上编译的，但看起来或多或少相同。

EDIT3：在阅读了每个程序员应该知道的关于内存的知识后，我意识到我错误地假设 OpenCV 函数实现的算法与我所做的或多或少相同，但事实并非如此。对于我插值的每个像素，我需要检索其 4x4 邻域，其像素不按顺序放置在图像缓冲区内。我在滥用 CPU 缓存，而 OpenCV 可能不会。VTune 分析似乎同意，因为我的函数有 5,800,000 次内存访问，而 OpenCV 只有 400,000 次。它们的功能一团糟，可能会进一步优化，但它仍然比我有优势，这可能是由于一些更智能的内存和缓存使用方法。

更新：我设法改进了将像素值加载到 XMM 寄存器中的方式。我在对象中分配了一个缓冲区，该缓冲区为图像的每个像素保存 16 个元素的单元格。在图像加载时，我为每个像素使用预先安排的 4x4 邻域序列填充此单元缓冲区。不是很节省空间（图像占用 16 倍的空间），但是这样，负载总是对齐的（不再是 _mm_loadu_ps()），并且我避免了从图像缓冲区中分散读取像素，因为所需的像素顺序存储。令我惊讶的是，几乎没有任何改善。我听说未对齐的负载可能会慢 10 倍，但显然这不是问题所在。但是通过注释掉部分代码，我发现 modf() 调用负责 75% 的运行时间！我将专注于消除这些并发布答案。

score 7 · Accepted Answer

首先是一些观察。

您使用函数静态变量，这可能会导致同步（我不认为它在这里）
该程序集混合了 x87 和 sse 代码。
tap4kernel 是内联的，这很好，但配置文件可能不准确。
modf 没有内联。
该程序集使用 _ftol2_sse（_ftol2_sse，有更快的选项吗？）。
组件移动了很多寄存器。

尝试执行以下操作：

确保您的编译器正在积极优化基于 SSE2 的架构。
使 modf 可用于内联。
避免使用函数静态变量。
如果程序集仍使用 x87 指令，请尝试避免将 float-intimgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1;强制转换为浮点型变量 __m128。

它可以通过预取地图进一步优化（提前约 4kb 预取）

c++ - 为什么我的手动调优、支持 SSE 的代码这么慢？

1 回答 1

Related

Reference