java - 最快的高斯模糊实现

Question

你如何实现最快的高斯模糊算法？

我打算用 Java 实现它，所以排除了GPU解决方案。我的应用程序planetGenesis是跨平台的，所以我不想要JNI。

score 33 · Accepted Answer

您应该使用高斯核是可分离的这一事实，即您可以将 2D 卷积表示为两个 1D 卷积的组合。

如果滤波器很大，使用空间域中的卷积等价于频域（傅立叶）域中的乘法这一事实也可能有意义。这意味着您可以对图像和滤波器进行傅里叶变换，将（复数）结果相乘，然后进行傅里叶逆变换。FFT（快速傅里叶变换）的复杂度为 O(n log n)，而卷积的复杂度为 O(n^2)。此外，如果您需要使用相同的过滤器模糊许多图像，您只需对过滤器进行一次 FFT。

如果您决定使用 FFT，FFTW 库是一个不错的选择。

score 28 · Accepted Answer

数学运动员可能知道这一点，但对其他人来说......

由于高斯具有良好的数学特性，您可以通过首先对图像的每一行运行 1D 高斯模糊，然后对每列运行 1D 模糊来快速模糊 2D 图像。

score 23 · Accepted Answer

终极解决方案

如此多的信息和实现让我非常困惑，我不知道我应该相信哪一个。想通之后，我决定写自己的文章。我希望它能为您节省数小时的时间。

最快的高斯模糊（线性时间）

它包含源代码，（我希望）它简短、干净且易于重写为任何其他语言。请投票给它，以便其他人可以看到它。

score 17 · Accepted Answer

我找到了Quasimondo：孵化器：处理：快速高斯模糊。此方法包含许多近似值，例如使用整数和查找表而不是浮点数和浮点除法。我不知道现代 Java 代码有多少加速。
Fast Shadows on Rectangles有一个使用B-splines的近似算法。
C# 中的快速高斯模糊算法声称有一些很酷的优化。
此外，David Everly 的Fast Gaussian Blur (PDF) 提供了一种用于高斯模糊处理的快速方法。

我会尝试各种方法，对它们进行基准测试并在此处发布结果。

出于我的目的，我从 Internet复制并实现了基本（独立处理 XY 轴）方法和 David Everly 的快速高斯模糊方法。它们的参数不同，所以我无法直接比较它们。然而，对于大的模糊半径，后者经历的迭代次数要少得多。此外，后者是一种近似算法。

score 8 · Accepted Answer

您可能希望框模糊，这要快得多。请参阅此链接以获取出色的教程和一些复制和粘贴 C 代码。

score 7 · Accepted Answer

For larger blur radiuses, try applying a box blur three times. This will approximate a Gaussian blur very well, and be much faster than a true Gaussian blur.

score 3 · Accepted Answer

我已经将 Ivan Kuckir 的快速高斯模糊实现转换为 Java，该实现使用三个通道和线性框模糊。正如他在自己的博客中所说，由此产生的过程是 O(n) 。如果您想了解更多关于为什么 3 时间框模糊近似于高斯模糊（3%）我的朋友，您可以查看框模糊和高斯模糊。

这是java实现。

@Override
public BufferedImage ProcessImage(BufferedImage image) {
    int width = image.getWidth();
    int height = image.getHeight();

    int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
    int[] changedPixels = new int[pixels.length];

    FastGaussianBlur(pixels, changedPixels, width, height, 12);

    BufferedImage newImage = new BufferedImage(width, height, image.getType());
    newImage.setRGB(0, 0, width, height, changedPixels, 0, width);

    return newImage;
}

private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
    ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
    BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}

private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
    double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);

    int filterWidth = (int) Math.floor(idealFilterWidth);

    if (filterWidth % 2 == 0) {
        filterWidth--;
    }

    int filterWidthU = filterWidth + 2;

    double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
    double m = Math.round(mIdeal);

    ArrayList<Integer> result = new ArrayList<>();

    for (int i = 0; i < n; i++) {
        result.add(i < m ? filterWidth : filterWidthU);
    }

    return result;
}

private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
    System.arraycopy(source, 0, output, 0, source.length);
    BoxBlurHorizantal(output, source, width, height, radius);
    BoxBlurVertical(source, output, width, height, radius);
}

private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius);
    for (int i = 0; i < height; i++) {
        int outputIndex = i * width;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
        float val = (radius) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
        }

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (radius + 1); j < (width - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (width - radius); j < width; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }
    }
}

private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius + 1);
    for (int i = 0; i < width; i++) {
        int outputIndex = i;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius * width;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
        float val = (radius + 1) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
        }
        for (int j = 0; j <= radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = radius + 1; j < (height - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = (height - radius); j < height; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            outputIndex += width;
        }
    }
}

score 2 · Accepted Answer

第 1 步：SIMD 1 维高斯模糊
第 2 步：转置
第 3 步：重复第 1 步

最好在小块上完成，因为全图像转置很慢，而使用PUNPCK链（PUNPCKHBW、PUNPCKHDQ、PUNPCKHWD、PUNPCKLBW、PUNPCKLDQ、PUNPCKLWD）可以非常快速地完成小块转置。

score 2 · Accepted Answer

我会考虑为此使用 CUDA 或其他一些 GPU 编程工具包，特别是如果您想使用更大的内核。如果做不到这一点，总会在装配中手动调整你的循环。

score 2 · Accepted Answer

我在研究中为这个问题苦苦挣扎，并尝试了一种有趣的快速高斯模糊方法。首先，如前所述，最好将模糊分成两个 1D 模糊，但根据您的硬件实际计算像素值，您实际上可以预先计算所有可能的值并将它们存储在查找表中。

换句话说，预先计算Gaussian coefficient*的每个组合input pixel value。当然，您需要离散化您的系数，但我只是想添加此解决方案。如果您有IEEE订阅，您可以在Fast image blurring using Lookup Table for real time feature extract中阅读更多内容。

最终，我最终还是使用了CUDA :)

score 2 · Accepted Answer

在一维中：

重复使用几乎任何内核进行模糊将倾向于高斯内核。这就是高斯分布的奇妙之处，也是统计学家喜欢它的原因。所以选择容易模糊的东西并多次应用它。

例如，使用盒形内核很容易模糊。首先计算一个累积和：

y(i) = y(i-1) + x(i)

然后：

blurred(i) = y(i+radius) - y(i-radius)

重复几次。

或者您可能会使用各种IIR滤波器来回切换几次，它们的速度同样快。

在 2D 或更高版本中：

正如DarenW所说，一个接一个地模糊每个维度。

score 1 · Accepted Answer

我在不同的地方看到了几个答案，并在这里收集它们，以便我可以尝试围绕它们并记住它们以供以后使用：

无论您使用哪种方法，都应使用 1D 过滤器分别过滤水平和垂直维度，而不是使用单个方形过滤器。

标准的“慢”方法：卷积滤波器
SIFT 中分辨率降低的分层图像金字塔
由中心极限定理推动的重复框模糊。Box Blur 是 Viola 和 Jones 人脸检测的核心，如果我没记错的话，他们称其为整体图像。我认为类似 Haar 的功能也使用了类似的东西。
Stack Blur：卷积和框模糊方法之间的基于队列的替代方案
IIR 滤波器
- Derich 滤波器（维基百科）二阶 IIR 滤波器
- van Vliet 过滤器我对这个一无所知
- 贝塞尔过滤器虽然有一些关于这些的争论

在回顾了所有这些之后，我想起了简单的、糟糕的近似在实践中通常效果很好。在另一个领域，Alex Krizhevsky 发现 ReLU 比他开创性的 AlexNet 中的经典 sigmoid 函数更快，尽管乍一看它们似乎是 Sigmoid 的可怕近似。

score 0 · Accepted Answer

二维数据的高斯模糊有几种快速方法。你应该知道什么。

这是可分离滤波器，因此只需要两个 1d 卷积。
对于大内核，您可以处理缩小的图像副本，而不是放大回来。
可以通过多个盒子过滤器（也是可分离的）来完成良好的近似，（可以调整迭代次数和内核大小）
现有 O(n) 复杂度算法（适用于任何内核大小），用于通过 IIR 滤波器进行精确的高斯近似。

您的选择取决于所需的速度、精度和实施复杂性。

score 0 · Accepted Answer

尝试按照我在这里的方式使用 Box Blur： Approximating Gaussian Blur Using Extended Box Blur

这是最好的近似值。

使用 Integral Images 可以让它更快。
如果你这样做，请分享你的解决方案。

score 0 · Accepted Answer

CWP 的 Dave Hale 有一个 minejtk 包，其中包括递归高斯滤波器（Deriche 方法和 Van Vliet 方法）。java子程序可以在https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java找到

Deriche 的方法对于高斯模糊（以及高斯的导数）似乎是一种非常好的方法。

score 0 · Accepted Answer

用现在（截至 2016 年）实施的新库来回答这个老问题，因为 Java 的 GPU 技术有许多新的进步。

正如其他几个答案所建议的那样，CUDA 是一种替代方案。但是java现在有CUDA支持。

IBM CUDA4J 库：提供用于管理和访问 GPU 设备、库、内核和内存的 Java API。使用这些新的 API，可以使用 Java 内存模型、异常和自动资源管理的便利性来编写管理 GPU 设备特性并将工作卸载到 GPU 的 Java 程序。

Jcuda：NVIDIA CUDA 和相关库的 Java 绑定。使用 JCuda，可以从 Java 程序与 CUDA 运行时和驱动程序 API 进行交互。

Aparapi：允许 Java 开发人员通过在 GPU 上执行数据并行代码片段来利用 GPU 和 APU 设备的计算能力，而不是局限于本地 CPU。

一些Java OpenCL 绑定库

https://github.com/ochafik/JavaCL：OpenCL的 Java 绑定：一个面向对象的 OpenCL 库，基于自动生成的低级绑定

http://jogamp.org/jocl/www/：OpenCL的 Java 绑定：一个面向对象的 OpenCL 库，基于自动生成的低级绑定

http://www.lwjgl.org/：OpenCL的 Java 绑定：自动生成的低级绑定和面向对象的便利类

http://jocl.org/：OpenCL的 Java 绑定：原始 OpenCL API 的 1:1 映射的低级绑定

上述所有这些库都将有助于在 CPU 上比 Java 中的任何实现更快地实现高斯模糊。

java - 最快的高斯模糊实现

16 回答 16

Related

Reference