希望有人可以提供有关如何计算一堆哈希的成对汉明距离然后对它们进行聚类的指导。我不太关心性能,而是看我在做什么和我想做的事情,无论如何它都会很慢,而且它不会反复运行。
所以......简而言之,我错误地从驱动器上删除了 1000 张照片并且没有备份(我知道......不好的做法)。使用各种工具,我能够从驱动器中恢复非常高的百分比,但留下了数百张 1000 张照片。由于用于恢复某些照片的技术(例如文件雕刻),一些图像在不同程度上损坏了,另一些是相同的副本,还有一些在视觉上基本相同,但逐字节不同。
我正在寻找帮助这种情况的方法如下:
- 检查每个图像并确定图像文件是否在结构上损坏(完成)
- 为每个图像生成感知散列(指纹),以便可以比较图像的相似性和聚类(完成指纹部分)
- 计算指纹的成对距离
- 聚类成对距离,以便可以一起查看相似的图像以帮助手动清理
在随附的脚本中,您会注意到我计算哈希的几个地方,我将解释为不会引起混淆......
- 对于 PIL 支持的图像,我生成三个散列,第一个用于原始图像,第二个旋转 90 度,第三个旋转 180 度。这样做是为了当成对计算完成时,我可以解释方向不同的图像。
- 对于 PIL 不支持的原始图像,我更喜欢从提取的嵌入预览图像生成的哈希值。我这样做而不是使用原始图像,因为在原始图像文件损坏的情况下,预览图像很可能由于其较小的尺寸而完好无损,因此更好地识别图像是否类似于其他
- 生成哈希的其他地方是在最后一次努力识别损坏的原始图像期间。我会将提取/转换的原始图像的哈希值与提取的嵌入预览图像的哈希值进行比较,如果相似度不符合定义的阈值,则假定整个原始文件可能已损坏。
我需要指导的是如何完成以下任务:
- 取每个图像的三个哈希值并计算汉明成对距离
- 对于每个图像比较,只保留最相似的汉明距离
- 将结果输入 scipy 层次聚类,以便我可以对相似的图像进行分组
我只是在学习 Python,所以这是我挑战的一部分......我认为从我从谷歌那里得到的信息中,我可以通过首先使用 scipy.spatial.distance.pdist 获取成对距离来做到这一点,然后处理它以保持每个图像比较的最相似距离,然后将其提供给 scipy 聚类函数。但我不知道如何组织这个并以正确的格式提供东西等。有人可以提供一些指导吗?
这是我当前的脚本以供参考,以防其他人发现我需要更改以存储某种哈希字典或某种磁盘存储很有趣。
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes = [None] * 4
print(filename)
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
metadata = pyexiv2.ImageMetadata(filepath)
metadata.read()
rotate = 0
try:
im = Image.open(filepath)
except:
None
else:
for x in range(3):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
else:
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
rotate = 0
try:
im = Image.open(temp_preview.name)
except:
None
else:
for x in range(4):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# close temp file
temp_preview.close()
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
with rawpy.imread(filepath) as im:
None
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
if len(metadata_orig.previews) > 0:
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
rhash = imagehash.dhash(Image.open(temp_raw.name),32)
# compare preview with raw image and compute the most similar hamming distance (best)
hamdiff = .0
for h in range(4):
# calculate hamming distance to compare similarity
hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)
if hamdiff < .7: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
# close temp files
temp_raw.close()
print(hamdiff)
print(rhash)
print(hashes[0])
print(hashes[1])
print(hashes[2])
print(hashes[3])
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
if corrupt_str is not None:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
else:
os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
else:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))
已编辑 只是想提供我最终提出的更新,这似乎非常适合我的预期目的,也许它对处于类似情况的其他用户有用。该脚本仍然可以使用一些抛光,但除此之外所有的肉都在那里。因为我很喜欢使用 Python,如果有人看到可以大大改进的东西,请告诉我。
该脚本执行以下操作:
- 尝试使用各种方法根据文件结构检测图像损坏。对于原始图像格式(NEF、DNG、TIF),有时我发现损坏的图像仍然可以正常加载,因此我决定对预览图像和原始图像的提取 .jpg 进行哈希处理,并比较哈希值是否相似够了,我假设图像以某种形式损坏。
- 为每个可以加载的图像创建感知散列。为基本文件创建了三个(原始文件、原始文件旋转 90、原始文件旋转 180)。此外,对于原始图像,为提取的预览图像创建了额外的 3 个哈希值,这样做是为了在原始图像损坏的情况下,我们仍然可以基于完整图像获得哈希值(假设预览很好)。
- 对于被识别为损坏的图像,它们会使用一个后缀进行重命名,该后缀表示损坏以及决定它的原因。
- 通过将哈希值与所有文件对进行比较来计算成对汉明距离,并将其存储在一个 numpy 数组中。
- 成对距离的平方形式被馈送到 fastcluster 进行聚类
- fastcluster 的输出用于生成树状图,以可视化相似图像的集群
我将 numpy 数组保存到磁盘,以便以后可以重新运行 fastcluster/dendrogram 部分,而无需重新计算每个速度较慢的文件的哈希值。这是我必须更改脚本才能允许的事情......
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
hashes = []
filelist = []
for root, _, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
relpath = os.path.relpath(root, sys.argv[1])
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes_tmp = []
rhash = []
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
try:
im=Image.open(filepath)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
except:
None
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
if ext in (rawimageext):
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
try:
im = Image.open(temp_preview.name)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
im = rawpy.imread(filepath)
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
try:
im = Image.open(temp_raw.name)
for x in range(3):
rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# compare preview with raw image and compute the most similar hamming distance (best)
if len(hashes_tmp) > 0 and len(rhash) > 0:
hamdiff = 1
for rh in rhash:
# calculate hamming distance to compare similarity
hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))
if hamdiff > .3: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
hashes_tmp = hashes_tmp + rhash
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
newfilename = None
if corrupt_str is not None:
if mo is not None:
newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
else:
newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
else:
if mo is not None:
newfilename = re.sub(corruptRegex, '', filename) + ext
if newfilename is not None:
os.rename(filepath,os.path.join(root, newfilename))
if len(hashes_tmp) > 0:
hashes.append(hashes_tmp)
if newfilename is not None:
filelist.append(os.path.join(relpath, newfilename))
else:
filelist.append(os.path.join(relpath, filename))
print(len(filelist))
print(len(hashes))
a = np.empty(shape=(len(filelist),len(filelist)))
for hash_idx1, hash in enumerate(hashes):
a[hash_idx1,hash_idx1] = 0
hash_idx2 = hash_idx1 + 1
while hash_idx2 < len(hashes):
ham_dist = 1
for h1 in hash:
for h2 in hashes[hash_idx2]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
a[hash_idx1,hash_idx2] = ham_dist
a[hash_idx2,hash_idx1] = ham_dist
hash_idx2 = hash_idx2 + 1
print(a)
X = squareform(a)
print(X)
linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')
plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)
with open('numpyarray.npy','wb') as f:
np.save(f,a)