python - 如何创建和保存图像感知散列的成对汉明距离以输入聚类算法

Question

希望有人可以提供有关如何计算一堆哈希的成对汉明距离然后对它们进行聚类的指导。我不太关心性能，而是看我在做什么和我想做的事情，无论如何它都会很慢，而且它不会反复运行。

所以......简而言之，我错误地从驱动器上删除了 1000 张照片并且没有备份（我知道......不好的做法）。使用各种工具，我能够从驱动器中恢复非常高的百分比，但留下了数百张 1000 张照片。由于用于恢复某些照片的技术（例如文件雕刻），一些图像在不同程度上损坏了，另一些是相同的副本，还有一些在视觉上基本相同，但逐字节不同。

我正在寻找帮助这种情况的方法如下：

检查每个图像并确定图像文件是否在结构上损坏（完成）
为每个图像生成感知散列（指纹），以便可以比较图像的相似性和聚类（完成指纹部分）
计算指纹的成对距离
聚类成对距离，以便可以一起查看相似的图像以帮助手动清理

在随附的脚本中，您会注意到我计算哈希的几个地方，我将解释为不会引起混淆......

对于 PIL 支持的图像，我生成三个散列，第一个用于原始图像，第二个旋转 90 度，第三个旋转 180 度。这样做是为了当成对计算完成时，我可以解释方向不同的图像。
对于 PIL 不支持的原始图像，我更喜欢从提取的嵌入预览图像生成的哈希值。我这样做而不是使用原始图像，因为在原始图像文件损坏的情况下，预览图像很可能由于其较小的尺寸而完好无损，因此更好地识别图像是否类似于其他
生成哈希的其他地方是在最后一次努力识别损坏的原始图像期间。我会将提取/转换的原始图像的哈希值与提取的嵌入预览图像的哈希值进行比较，如果相似度不符合定义的阈值，则假定整个原始文件可能已损坏。

我需要指导的是如何完成以下任务：

取每个图像的三个哈希值并计算汉明成对距离
对于每个图像比较，只保留最相似的汉明距离
将结果输入 scipy 层次聚类，以便我可以对相似的图像进行分组

我只是在学习 Python，所以这是我挑战的一部分......我认为从我从谷歌那里得到的信息中，我可以通过首先使用 scipy.spatial.distance.pdist 获取成对距离来做到这一点，然后处理它以保持每个图像比较的最相似距离，然后将其提供给 scipy 聚类函数。但我不知道如何组织这个并以正确的格式提供东西等。有人可以提供一些指导吗？

这是我当前的脚本以供参考，以防其他人发现我需要更改以存储某种哈希字典或某种磁盘存储很有趣。

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes = [None] * 4
            print(filename)
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                metadata = pyexiv2.ImageMetadata(filepath)
                metadata.read()
                rotate = 0
                try:
                    im = Image.open(filepath)
                except:
                    None
                else:
                    for x in range(3):
                        hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            else:
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    rotate = 0
                    try:
                        im = Image.open(temp_preview.name)
                    except:
                        None
                    else:
                        for x in range(4):
                            hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
                    # close temp file
                    temp_preview.close()

                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    with rawpy.imread(filepath) as im:
                        None
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    if len(metadata_orig.previews) > 0:
                        # extract and convert raw to jpeg image using ufraw
                        temp_raw = NamedTemporaryFile(suffix='.jpg')

                        try:
                            check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                        except:
                            e = sys.exc_info()[0]
                            corrupt_str = 'Ufraw-conv'

                        else:
                            rhash = imagehash.dhash(Image.open(temp_raw.name),32)

                            # compare preview with raw image and compute the most similar hamming distance (best)
                            hamdiff = .0
                            for h in range(4):
                                # calculate hamming distance to compare similarity
                                hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)

                            if hamdiff < .7: # raw file is probably corrupt
                                corrupt_str = 'hash' + str(round(hamdiff*100,2))
                        # close temp files
                        temp_raw.close()
                        print(hamdiff)
                        print(rhash)

            print(hashes[0])
            print(hashes[1])
            print(hashes[2])
            print(hashes[3])

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            if corrupt_str is not None:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
                else:
                    os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
            else:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))

已编辑 只是想提供我最终提出的更新，这似乎非常适合我的预期目的，也许它对处于类似情况的其他用户有用。该脚本仍然可以使用一些抛光，但除此之外所有的肉都在那里。因为我很喜欢使用 Python，如果有人看到可以大大改进的东西，请告诉我。

该脚本执行以下操作：

尝试使用各种方法根据文件结构检测图像损坏。对于原始图像格式（NEF、DNG、TIF），有时我发现损坏的图像仍然可以正常加载，因此我决定对预览图像和原始图像的提取 .jpg 进行哈希处理，并比较哈希值是否相似够了，我假设图像以某种形式损坏。
为每个可以加载的图像创建感知散列。为基本文件创建了三个（原始文件、原始文件旋转 90、原始文件旋转 180）。此外，对于原始图像，为提取的预览图像创建了额外的 3 个哈希值，这样做是为了在原始图像损坏的情况下，我们仍然可以基于完整图像获得哈希值（假设预览很好）。
对于被识别为损坏的图像，它们会使用一个后缀进行重命名，该后缀表示损坏以及决定它的原因。
通过将哈希值与所有文件对进行比较来计算成对汉明距离，并将其存储在一个 numpy 数组中。
成对距离的平方形式被馈送到 fastcluster 进行聚类
fastcluster 的输出用于生成树状图，以可视化相似图像的集群

我将 numpy 数组保存到磁盘，以便以后可以重新运行 fastcluster/dendrogram 部分，而无需重新计算每个速度较慢的文件的哈希值。这是我必须更改脚本才能允许的事情......

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')

hashes = []
filelist = []

for root, _, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        relpath = os.path.relpath(root, sys.argv[1])
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes_tmp = []
            rhash = []
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                try:
                    im=Image.open(filepath)
                    for x in range(3):
                        hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
                except:
                    None

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            if ext in (rawimageext):
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    try:
                        im = Image.open(temp_preview.name)
                        for x in range(3):
                            hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                    except:
                        None


                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    im = rawpy.imread(filepath)
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    # extract and convert raw to jpeg image using ufraw
                    temp_raw = NamedTemporaryFile(suffix='.jpg')

                    try:
                        check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'Ufraw-conv'

                    else:
                        try:
                            im = Image.open(temp_raw.name)
                            for x in range(3):
                                rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                        except:
                            None

                # compare preview with raw image and compute the most similar hamming distance (best)
                if len(hashes_tmp) > 0 and len(rhash) > 0:
                    hamdiff = 1
                    for rh in rhash:
                        # calculate hamming distance to compare similarity
                        hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))

                        if hamdiff > .3: # raw file is probably corrupt
                            corrupt_str = 'hash' + str(round(hamdiff*100,2))

                hashes_tmp = hashes_tmp + rhash

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            newfilename = None
            if corrupt_str is not None:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
                else:
                    newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
            else:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '', filename) + ext

            if newfilename is not None:
                os.rename(filepath,os.path.join(root, newfilename))

            if len(hashes_tmp) > 0:
                hashes.append(hashes_tmp)
                if newfilename is not None:
                    filelist.append(os.path.join(relpath, newfilename))
                else:
                    filelist.append(os.path.join(relpath, filename))

print(len(filelist))
print(len(hashes))

a = np.empty(shape=(len(filelist),len(filelist)))

for hash_idx1, hash in enumerate(hashes):
    a[hash_idx1,hash_idx1] = 0
    hash_idx2 = hash_idx1 + 1
    while hash_idx2 < len(hashes):
        ham_dist = 1
        for h1 in hash:
            for h2 in hashes[hash_idx2]:
                ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
        a[hash_idx1,hash_idx2] = ham_dist
        a[hash_idx2,hash_idx1] = ham_dist
        hash_idx2 = hash_idx2 + 1

print(a)

X = squareform(a)
print(X)

linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')

plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)

with open('numpyarray.npy','wb') as f:
    np.save(f,a)

score 0 · Accepted Answer

花了一些时间……但我最终弄明白了，得到了一个脚本，它可以很好地识别图像是否损坏，然后使用感知散列尝试将相似的图像组合在一起。

from PIL import Image, ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import Popen, PIPE
import shlex
import numpy as np
from scipy.cluster.hierarchy import dendrogram, fcluster
from scipy.spatial.distance import squareform
import fastcluster
#import matplotlib.pyplot as plt
import math
import string
from wand.image import Image as wImage
import wand.exceptions
from io import BytesIO
from datetime import datetime
#import fd_table_status

def redirect_stdout():
    print("Redirecting stdout and stderr")
    sys.stdout.flush() # <--- important when redirecting to files
    sys.stderr.flush()
    newstdout = os.dup(1)
    newstderr = os.dup(2)
    devnull = os.open(os.devnull, os.O_WRONLY)
    devnull2 = os.open(os.devnull, os.O_WRONLY)
    os.dup2(devnull, 1)
    os.dup2(devnull2,2)
    os.close(devnull)
    os.close(devnull2)
    sys.stdout = os.fdopen(newstdout, 'w')
    sys.stderr = os.fdopen(newstderr, 'w')

redirect_stdout()

def ct(linkage_matrix,flist,score):
    cluster_id = []
    for fidx, file_ in enumerate(flist):
        link_ = np.where(linkage_matrix[:,:2] == fidx)[0]
        if len(link_) == 1:
            link = link_[0]
            if linkage_matrix[link][2] <= score:
                fcluster_idx = str(link).zfill(len(str(len(linkage_matrix))))
                while True:
                    match = np.where(linkage_matrix[:,:2] == link+1+len(linkage_matrix))[0]
                    if len(match) == 1:
                        link = match[0]
                        link_d = linkage_matrix[link]
                        if link_d[2] <= score:
                            fcluster_idx = str(match[0]).zfill(len(str(len(linkage_matrix)))) + fcluster_idx
                        else:
                            break
                    else:
                        break
            else:
                fcluster_idx = None

            cluster_id.append(fcluster_idx)

    return cluster_id

def get_exitcode_stdout_stderr(cmd):
    """
    Execute the external command and get its exitcode, stdout and stderr.
    """
    args = shlex.split(cmd)

    proc = Popen(args, stdout=PIPE, stderr=PIPE, close_fds=True)
    out, err = proc.communicate()
    exitcode = proc.returncode

    del proc

    return exitcode, out, err

if os.path.isdir(sys.argv[1]):
    start_time = datetime.now()
    # allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
    ImageFile.LOAD_TRUNCATED_IMAGES = True

    # image files this script will handle
    # PIL supported image formats
    stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
    # libraw/ufraw supported formats
    rawimageext = ('.nef', '.dng', '.tif', '.tiff')

    corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
    groupRegex = re.compile(r'^\[\d+\]_')
    ufrawRegex = re.compile(r'Corrupt data near|Unexpected end of file|has the wrong dimensions!|Cannot open file|Cannot decode file|requests a nonexistent image!')

    for subdirs,dirs,files in os.walk(sys.argv[1]):
        files.clear()
        dirs.clear()
        for root,_,files in os.walk(subdirs):
            print('\n******** Processing files in ' + root)
            hashes = []
            w_hash = []
            w_hash_idx = []
            filelist = []
            files_ = []
            cnt = 0
            for f in files:
                #cnt = cnt + 1
                #if cnt < 10:
                files_.append(f)
                continue
            cnt = 0

            for f_idx, fname in enumerate(files_):
                e=None
                ext = os.path.splitext(fname.lower())[1]
                filepath = os.path.join(root, fname)

                imformat = ''
                hashes_tmp = []

                # reset corrupt string
                corrupt_str = None

                if ext in (stdimageext + rawimageext):
                    print(str(int(round(((f_idx+1)/len(files_))*100))) + '%' + ' : ' + fname + '....', end='', flush=True)
                    try:
                        with wImage(filename=filepath) as im:
                            imformat = '.' + im.format.lower()
                            ext = imformat if imformat is not '' else ext
                            with im.convert('jpeg') as converted:
                                jpeg_bin = converted.make_blob()
                                with Image.open(BytesIO(jpeg_bin)) as im2:
                                    hash_image = []
                                    for x in range(3):
                                        print('.',end='',flush=True)
                                        hash_i = str(imagehash.dhash(im2.rotate(90 * x, expand=1),32))
                                        if ''.join(set(hash_i)) != '0':
                                            hash_image.append(hash_i)
                                    if hash_image:
                                        hash_image.append(1)
                                        hashes_tmp.append(hash_image)
                    except:
                        e = sys.exc_info()[0]
                        errcode = str([k for k, v in wand.exceptions.TYPE_MAP.items() if v == e][0]).zfill(3)
                        if int(errcode[-2:]) in (15,25,30,35,40,50,55):
                            corrupt_str = 'magick'
                    finally:
                        try:
                            im.close()
                        except:
                            pass
                        try:
                            im2.close()
                        except:
                            pass

                    if ext in (stdimageext):
                        try:
                            with Image.open(filepath) as im:
                                hash_image = []
                                for x in range(3):
                                    print('.',end='',flush=True)
                                    hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                    if ''.join(set(hash_i)) != '0':
                                        hash_image.append(hash_i)
                                if hash_image:
                                    hash_image.append(2)
                                    hashes_tmp.append(hash_image)
                        except:
                            pass
                        finally:
                            try:
                                im.close()
                            except:
                                pass

                        # use jpeginfo against all jpg images as its pretty accurate
                        if ext in ('.jpg','.jpeg'):
                            #rc = 0
                            print('.',end='',flush=True)
                            cmd = 'jpeginfo --check "' + filepath + '"'
                            exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                            #rc = call(["jpeginfo", "--check", filepath], stdout=DEVNULL, stderr=DEVNULL, close_fds=True)
                            if exitcode == 1:
                                corrupt_str = 'JpegInfo' if corrupt_str == None else corrupt_str
                            #del rc

                        if corrupt_str is None:
                            try:
                                with Image.open(filepath) as im:
                                    print('.',end='',flush=True)
                                    im.verify()
                            except:
                                e = sys.exc_info()[0]
                                corrupt_str = 'PIL_Verify' if corrupt_str == None else corrupt_str
                            else:
                                try:
                                    with Image.open(filepath) as im:
                                        print('.',end='',flush=True)
                                        temp = im.copy()
                                        im.load()
                                except:
                                    e =  sys.exc_info()[0]
                                    corrupt_str = 'PIL_Load' if corrupt_str == None else corrupt_str
                                finally:
                                    try:
                                        temp.close()
                                    except:
                                        pass
                                    try:
                                        im.close()
                                    except:
                                        pass
                            finally:
                                try:
                                    im.close()
                                except:
                                    pass
                                try:
                                    temp.close()
                                except:
                                    pass

                    # raw image processing
                    if ext in (rawimageext):
                        print('.',end='',flush=True)
                        # try to load raw using libraw via rawpy first, 
                        # generally if libraw can't load it then ufraw extraction would also fail
                        if corrupt_str == None:
                            try:
                                with rawpy.imread(filepath) as raw:
                                    rgb = raw.postprocess(use_camera_wb=True)
                                    temp_raw = NamedTemporaryFile(suffix='.jpg')
                                    Image.fromarray(rgb).save(temp_raw.name)
                                    with Image.open(temp_raw.name) as im:
                                        hash_image = []
                                        for x in range(3):
                                            print('.',end='',flush=True)
                                            hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                            if ''.join(set(hash_i)) != '0':
                                                hash_image.append(hash_i)
                                        if hash_image:
                                            hash_image.append(3)
                                            hashes_tmp.append(hash_image)

                            except(rawpy.LibRawFatalError):
                                e = sys.exc_info()[1]
                                corrupt_str = 'Libraw_FE'
                            except(rawpy.LibRawNonFatalError):
                                e = sys.exc_info()[1]
                                corrupt_str = 'Libraw_NFE'
                            except:
                                #print(sys.exc_info())
                                corrupt_str = 'Libraw'

                            finally:
                                try:
                                    im.close()
                                except:
                                    pass
                                try:
                                    temp_raw.close()
                                except:
                                    pass
                                try:
                                    raw.close()
                                except:
                                    pass
                            if corrupt_str == None:
                                # as a final last ditch effort compare perceptual hashes of extracted 
                                # raw and embedded preview to detect possible internal corruption 

                                # extract and convert raw to jpeg image using ufraw
                                temp_raw = NamedTemporaryFile(suffix='.jpg')
                                #rc = 0
                                cmd = 'ufraw-batch --wb=camera --rotate=camera --out-type=jpg --compression=95 --noexif --lensfun=none --auto-crop --output=' + temp_raw.name + ' --overwrite "' + filepath + '"'
                                print('.',end='',flush=True)
                                exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                                if exitcode == 1 or ufrawRegex.search(str(err)) is not None:
                                    corrupt_str = 'Ufraw' if corrupt_str is None else corrupt_str

                                tmpfilesize = os.stat(temp_raw.name).st_size
                                if tmpfilesize > 0:
                                    try:
                                        with Image.open(temp_raw.name) as im:
                                            hash_image = []
                                            for x in range(3):
                                                print('.',end='',flush=True)
                                                hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                if ''.join(set(hash_i)) != '0':
                                                    hash_image.append(hash_i)
                                            if hash_image:
                                                hash_image.append(4)
                                                hashes_tmp.append(hash_image)
                                    except:
                                        pass
                                    finally:
                                        try:
                                            im.close()
                                        except:
                                            pass
                                try:
                                    temp_raw.close()
                                except:
                                    pass


                        # attempt to extract preview images
                        imfile = filepath
                        try:
                            with pyexiv2.ImageMetadata(imfile) as metadata_orig:
                                metadata_orig.read()
                                #for i,p in enumerate(metadata_orig.previews):
                                if metadata_orig.previews:
                                    preview = metadata_orig.previews[-1]
                                    # save preview to temp file
                                    temp_preview = NamedTemporaryFile()
                                    preview.write_to_file(temp_preview.name)
                                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                                    try:
                                        with Image.open(temp_preview.name) as im:
                                            hash_image = []
                                            for x in range(3):
                                                print('.',end='',flush=True)
                                                hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                if ''.join(set(hash_i)) != '0':
                                                    hash_image.append(hash_i)
                                            if hash_image:
                                                hash_image.append(5)
                                                hashes_tmp.append(hash_image)
                                    except:
                                        pass
                                    finally:
                                        try:
                                            temp_preview.close()
                                        except:
                                            pass
                                        try:
                                            im.close()
                                        except:
                                            pass
                        except:
                            pass
                        finally:
                            try:
                                metadata_orig.close()
                            except:
                                pass

                    # compare hashes for all images that were found or extracted and find most dissimilar hamming distance (worst)
                    if len(hashes_tmp) > 1:
                        #print('checking_hashes')
                        print('.',end='',flush=True)
                        scores = []

                        for h_idx, hash in enumerate(hashes_tmp):
                            i = h_idx + 1
                            while i < len(hashes_tmp):
                                ham_dist = 1
                                for h1 in hash[:-1]:
                                    for h2 in hashes_tmp[i][:-1]:
                                        ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                                if (hash[-1] == 5 and hashes_tmp[i][-1] != 5) or (hash[-1] != 5 and hashes_tmp[i][-1] == 5):
                                    scores.append([ham_dist,hash[-1],hashes_tmp[i][-1]])
                                i = i + 1
                        if scores:
                            worst = sorted(scores, key = lambda x: x[0])[-1]

                            if worst[0] > 0.3:
                                worst1 = str(worst[1])
                                worst2 = str(worst[2])
                                corrupt_str = 'hash' + str(round(worst[0]*100,2)) + '_' + worst1 + '-' + worst2 if corrupt_str == None else corrupt_str

                    # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
                    mo = corruptRegex.search(fname)
                    newfilename = None
                    if corrupt_str is not None:
                        print('Corrupt: ' + corrupt_str)
                        if mo is not None:
                            newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', fname) + ext
                        else:
                            newfilename = os.path.splitext(fname)[0] + '_[' + corrupt_str + ']' + ext
                    else:
                        print('OK!')
                        if mo is not None:
                            newfilename = re.sub(corruptRegex, '', fname) + ext

                    # remove group index from name if present, this will be assigned in the next step if needed
                    newfilename = newfilename if newfilename is not None else fname
                    mo = groupRegex.search(newfilename)
                    if mo is not None:
                        newfilename = re.sub(groupRegex, '', newfilename)

                    if hashes_tmp:
                        # set function unduplicates flattened list
                        hashes.append(set([item for sublist in hashes_tmp for item in sublist[:-1]]))

                    filelist.append([root,fname,newfilename, len(hashes_tmp)])


            print('******** Grouping similar images... ************')
            if len(hashes) > 1:
                scores = []
                for h_idx, hash in enumerate(hashes):
                    i = h_idx + 1
                    while i < len(hashes):
                        ham_dist = 1
                        for h1 in hash:
                            for h2 in hashes[i]:
                                ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                        scores.append(ham_dist)
                        i = i + 1
                X = np.array(scores)

                linkage = fastcluster.single(X)
                w_hash_idx = [el_idx for el_idx, el in enumerate(filelist) if el[3] > 0]
                w_hash = [filelist[i] for i in w_hash_idx]

                test=ct(linkage,[el[2] for el in w_hash],.2)
                for i, prfx in enumerate(test):
                    curfilename = w_hash[i][2]

                    mo = groupRegex.search(curfilename)
                    newfilename = None

                    if prfx is not None:
                        if mo is not None:
                            newfilename = re.sub(groupRegex, '[' + prfx + ']_', curfilename)
                        else:
                            newfilename = '[' + prfx + ']_' + curfilename
                    else:
                        if mo is not None:
                            newfilename = re.sub(groupRegex, '', curfilename)

                #    if newfilename is not None:
                    filelist[w_hash_idx[i]][2] = newfilename if newfilename is not None else curfilename

                #fig = plt.figure(figsize=(25,25))
                #plt.title(root)
                #plt.xlabel('perpetual hash hamming distance')

                #plt.axvline(x=.15,c='red',linestyle='--')
                #dg = dendrogram(linkage, labels=[el[2] for el in w_hash], orientation='right', show_leaf_counts=True)
                #ax = fig.gca()
                #ax.set_xlim(-.02,ax.get_xlim()[1])
                #plt.show
                #plt.savefig(os.path.join(root,'dendrogram.pdf'), bbox_inches='tight', dpi=100)
                w_hash.clear()
                w_hash_idx.clear()
            print('******** Renameing file if applicable... ************')
            for fr in filelist:
                if fr[1] != fr[2]:
                    #print(fr[1] + ' -- ' + fr[2])
                    path = fr[0]
                    os.rename(os.path.join(path,fr[1]),os.path.join(path,fr[2]))


            filelist.clear()

    duration = datetime.now() - start_time
    days    = divmod(duration.total_seconds(), 86400)        # Get days (without [0]!)
    hours   = divmod(days[1], 3600)               # Use remainder of days to calc hours
    minutes = divmod(hours[1], 60)                # Use remainder of hours to calc minutes
    seconds = divmod(minutes[1], 1)               # Use remainder of minutes to calc seconds
    print("Time to complete: %d days, %d:%d:%d" % (days[0], hours[0], minutes[0], seconds[0]))

python - 如何创建和保存图像感知散列的成对汉明距离以输入聚类算法

1 回答 1

Related

Reference