c++ - How to pick up repeat image pairs (exactly same) among lots of lossless compressed images ? How to std::hash in memory?

Question

My application problem is that, I can get around 500 images, but there might be 1 or 2 of a pair of 2 images are completely the same, this means the files' checksum are the same. My eventual goal is to find out which ones are the repeated image paris.

However now I have to apply a compression algorithm on these 500 images, because the uncompressed images occupy too much disk space. Well, the compression breaks the checksum, so that I cannot use the checksum of the compressed images file to find out which are the repeated image pairs.

Fortunately, my compression algorithm is lossless, this means the restored uncompressed images can still be hashed somehow. But I just want to do this in memory without much disk write access. So my problem is how to efficiently pick up repeated image among large number of images files in memory?

I use opencv often, but the answer will be good as long as it is efficient without saving any file on disk. Python/Bash code will be also acceptable, C/C++ and OpenCV is preferred.

I can think of use OpenCV 's Mat, with std::hash, but std::hash won't work directly, I have to code the std::hash<cv::Mat> specifically, and I don't know how to do it properly yet.

Of course I can do this,

For each 2 images in all my images:
            if ((cv::Mat)img1 == (cv::Mat)img2):
                   print img1 and img2 are identical

But this is extremely inefficient, basically a n^4 algorithm.

Note my problem is not image similarity problem, it is a hashing problem in memroy.

score 1 · Accepted Answer

获取image的hash算法的思路：

减小原始图像的大小（cvResize（）），这样只有重要的物体会留在图片上（摆脱高频）。将图像缩小到 8x8 ，那么总像素数将为 64 并且散列将适合各种图像，无论它们的大小和纵横比如何。
去除颜色。将上一步得到的图像转换为灰度图。(cvCvtColor ())。因此，哈希将从 192（三个通道的 64 个值 - 红、绿和蓝）减少到 64 个亮度值。
找到结果图像的平均亮度。(cvAvg ())
图像的二值化。(cvThreshold ()) 只保留那些大于平均值的像素（认为它们是 1，而所有其他的都是 0）。
构建哈希。将 1 和 0 图片的 64 个值翻译成一个 64 位的哈希值。

接下来，如果您需要比较两个图像，那么只需为每个图像构建一个哈希并计算不同位的数量（使用汉明距离）。汉明距离——两个相同长度的二进制字各自的个数不同的位置个数。

距离为零意味着它很可能是同一幅图像，而其他值则表示它们之间的差异有多大。

score 0 · Accepted Answer

好的，我自己想出了一个解决方案，如果有更好的解决方案欢迎您。我在这里粘贴代码。

#include "opencv2/core/core.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <cstdio>
#include <iostream>
#include <string>
#include <cstring>
#include <functional>
#include <openssl/md5.h>

using namespace std;
using namespace cv;

static void help()
{
}

char *str2md5(const char *str, int length) {
    int n;
    MD5_CTX c;
    unsigned char digest[16];
    char *out = (char*)malloc(33);

    MD5_Init(&c);

    while (length > 0) {
        if (length > 512) {
            MD5_Update(&c, str, 512);
        } else {
            MD5_Update(&c, str, length);
        }
        length -= 512;
        str += 512;
    }

    MD5_Final(digest, &c);

    for (n = 0; n < 16; ++n) {
        snprintf(&(out[n*2]), 16*2, "%02x", (unsigned int)digest[n]);
    }

    return out;
}


int main(int argc, const char** argv)
{
    help();

    if (argc != 2)
    {
        return EXIT_FAILURE ;
    }

    string inputfile = argv[1] ;

    Mat src = imread (inputfile, -1) ;

    if (src.empty())
    {
        return EXIT_FAILURE ;
    }



    cout << str2md5((char*)src.data, (int)src.step[0] * src.rows) << " " << inputfile << endl ;




    return 0;
}

您必须在您的机器上安装 OpenSSL (libssl-dev) 才能编译此代码。它将图像加载到内存中，并计算它的 md5 值。因此，要找出重复的图像对，只需编写一个简单的 bash/python 脚本，使用已编译的程序在文件的 md5 值数组中进行搜索。请注意，此 md5 检查代码不适用于大型图像文件。

score 0 · Accepted Answer

如果它是您想要的图像的精确副本，您可以开始比较所有图像的像素 1,1，并按像素 1,1 上的相同值对它们进行分组。之后，您知道组（希望有很多组？），然后比较每个组像素 1,2 。这样，您逐个像素地进行操作，直到获得一百个左右的组为止。比你在每个组中完整地比较它们。这样你就可以使用慢速 n^4 算法，但每次都是以五张图片为一组，而不是一次处理 500 张图片。我假设您可以逐个像素地读取图像，我知道如果它们在 .fits 中，使用 pyfits 模块，这是可能的，但我想几乎任何图像格式都存在替代方案？

所以这背后的想法是，如果像素 1,1 不同，那么整个图像就会不同。通过这种方式，您可以使用前 3 个像素左右的值制作一些列表。如果在该列表中有足够的可变性，您可以对更小的图像组进行 1-1 完整图像检查，而不是一次检查 500 个图像。这听起来像它应该做你想做的事吗？

c++ - How to pick up repeat image pairs (exactly same) among lots of lossless compressed images ? How to std::hash in memory?

3 回答 3

Related

Reference