android - 为什么 ScriptIntrinsicBlur 比我的方法快？

Question

我使用 Renderscript 对图像进行高斯模糊。但不管我做了什么。ScriptIntrinsicBlur 更快更快。为什么会这样？ScriptIntrinsicBlur 正在使用另一种方法？这是我的 RS 代码：

#pragma version(1)
#pragma rs java_package_name(top.deepcolor.rsimage.utils)

//aussian blur algorithm.

//the max radius of gaussian blur
static const int MAX_BLUR_RADIUS = 1024;

//the ratio of pixels when blur
float blurRatio[(MAX_BLUR_RADIUS << 2) + 1];

//the acquiescent blur radius
int blurRadius = 0;

//the width and height of bitmap
uint32_t width;
uint32_t height;

//bind to the input bitmap
rs_allocation input;
//the temp alloction
rs_allocation temp;

//set the radius
void setBlurRadius(int radius)
{
    if(1 > radius)
        radius = 1;
    else if(MAX_BLUR_RADIUS < radius)
        radius = MAX_BLUR_RADIUS;

    blurRadius = radius;


    /**
    calculate the blurRadius by Gaussian function
    when the pixel is far way from the center, the pixel will not contribute to the center
    so take the sigma is blurRadius / 2.57
    */
    float sigma = 1.0f * blurRadius / 2.57f;
    float deno  = 1.0f / (sigma * sqrt(2.0f * M_PI));
    float nume  = -1.0 / (2.0f * sigma * sigma);

    //calculate the gaussian function
    float sum = 0.0f;
    for(int i = 0, r = -blurRadius; r <= blurRadius; ++i, ++r)
    {
        blurRatio[i] = deno * exp(nume * r * r);
        sum += blurRatio[i];
    }

    //normalization to 1
    int len = radius + radius + 1;
    for(int i = 0; i < len; ++i)
    {
        blurRatio[i] /= sum;
    }

}

/**
the gaussian blur is decomposed two steps:1
1.blur in the horizontal
2.blur in the vertical
*/
uchar4 RS_KERNEL horizontal(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int horizontalIndex = x + k;

        if(0 > horizontalIndex) horizontalIndex = 0;
        if(width <= horizontalIndex) horizontalIndex = width - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(input, horizontalIndex, y);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

uchar4 RS_KERNEL vertical(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int verticalIndex = y + k;

        if(0 > verticalIndex) verticalIndex = 0;
        if(height <= verticalIndex) verticalIndex = height - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(temp, x, verticalIndex);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

score 2 · Accepted Answer

Renderscript 内在函数的实现与您使用自己的脚本可以实现的非常不同。这有几个原因，但主要是因为它们是由单个设备的 RS 驱动程序开发人员构建的，其方式可以充分利用特定的硬件/SoC 配置，并且很可能对硬件进行低级别调用，这很简单在 RS 编程层不可用。

不过，Android 确实提供了这些内在函数的通用实现，以防万一没有较低的硬件实现可用，可以“回退”。看看这些通用的是如何完成的，会让你更好地了解这些内在函数是如何工作的。例如，您可以在此处rsCpuIntrinsicConvolve3x3.cpp 看到3x3 卷积内在函数的通用实现的源代码。

仔细查看从该源文件第 98 行开始的代码，并注意它们如何不使用任何 for 循环来进行卷积。这称为展开循环，您可以在代码中显式添加和相乘 9 个相应的内存位置，从而避免使用 for 循环结构。这是优化并行代码时必须考虑的第一条规则。您需要摆脱内核中的所有分支。查看您的代码，您有很多if' 和for' 导致分支 - 这意味着程序的控制流不是从头到尾直通的。

如果展开 for 循环，您将立即看到性能提升。请注意，通过删除您的 for 结构，您将不再能够针对所有可能的半径量概括您的内核。在这种情况下，您必须为不同的半径创建固定内核，这正是您看到单独的 3x3 和 5x5 卷积内在函数的原因，因为这正是它们所做的。（参见rsCpuIntrinsicConvolve5x5.cpp的 5x5 内在函数的第 99 行）。

此外，您有两个单独的内核这一事实也无济于事。如果你正在做一个高斯模糊，卷积核确实是可分离的，你可以像你在那里所做的那样做 1xN + Nx1 卷积，但我建议将两个通道放在同一个内核中。

但请记住，即使使用这些技巧可能仍然无法为您提供与实际内在函数一样快的结果，因为这些技巧可能已经针对您的特定设备进行了高度优化。

android - 为什么 ScriptIntrinsicBlur 比我的方法快？

1 回答 1

Related

Reference