python - 仅计算图像的核心图像数据（不包括元数据）的哈希值

Question

我正在编写一个脚本来计算不包括 EXIF 标记的图像的 MD5 和。

为了准确地做到这一点，我需要知道 EXIF 标签在文件中的位置（开始、中间、结束），以便我可以排除它。

如何确定标签在文件中的位置？

我正在扫描的图像格式为 TIFF、JPG、PNG、BMP、DNG、CR2、NEF，以及一些视频 MOV、AVI 和 MPG。

score 22 · Accepted Answer

使用 Python Imaging Library 提取图片数据要容易得多（iPython 中的示例）：

In [1]: import Image

In [2]: import hashlib

In [3]: im = Image.open('foo.jpg')

In [4]: hashlib.md5(im.tobytes()).hexdigest()
Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'

这适用于 PIL 可以处理的任何类型的图像。该tobytes方法返回包含像素数据的字符串。

顺便说一句，MD5 哈希现在被认为非常弱。最好使用 SHA512：

In [6]: hashlib.sha512(im.tobytes()).hexdigest()
Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'

在我的机器上，计算 2500x1600 JPEG 的 MD5 校验和大约需要 0.07 秒。使用 SHA512，需要 0.10 秒。完整示例：

#!/usr/bin/env python3

from PIL import Image
import hashlib
import sys

im = Image.open(sys.argv[1])
print(hashlib.sha512(im.tobytes()).hexdigest(), end="")

For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.

score 8 · Accepted Answer

一种简单的方法是散列核心图像数据。对于PNG，您可以通过仅计算“关键块”（即以大写字母开头的块）来做到这一点。JPEG 具有类似但更简单的文件结构。

ImageMagick 中的视觉散列在散列图像时解压缩图像。在您的情况下，您可以立即对压缩的图像数据进行散列，因此（如果实施正确）它应该与散列原始文件一样快。

这是一个说明这个想法的小 Python 脚本。它可能适合你，也可能不适合你，但它至少应该说明我的意思:)

import struct
import os
import hashlib

def png(fh):
    hash = hashlib.md5()
    assert fh.read(8)[1:4] == "PNG"
    while True:
        try:
            length, = struct.unpack(">i",fh.read(4))
        except struct.error:
            break
        if fh.read(4) == "IDAT":
            hash.update(fh.read(length))
            fh.read(4) # CRC
        else:
            fh.seek(length+4,os.SEEK_CUR)
    print "Hash: %r" % hash.digest()

def jpeg(fh):
    hash = hashlib.md5()
    assert fh.read(2) == "\xff\xd8"
    while True:
        marker,length = struct.unpack(">2H", fh.read(4))
        assert marker & 0xff00 == 0xff00
        if marker == 0xFFDA: # Start of stream
            hash.update(fh.read())
            break
        else:
            fh.seek(length-2, os.SEEK_CUR)
    print "Hash: %r" % hash.digest()


if __name__ == '__main__':
    png(file("sample.png"))
    jpeg(file("sample.jpg"))

score 3 · Accepted Answer

You can use stream which is part of the ImageMagick suite:

$ stream -map rgb -storage-type short image.tif - | sha256sum
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  -

or

$ sha256sum <(stream -map rgb -storage-type short image.tif -)
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  /dev/fd/63

This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb and a short storage-type (you can use char here if the RGB values are 8-bits).

This method reports the same signature hash that the verbose Imagemagick identify command reports:

$ identify -verbose image.tif | grep signature
signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64

(for ImageMagick v6.x; the hash reported by identify on version 7 is different to that obtained using stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw for some image types.)

score 1 · Accepted Answer

I would use a metadata stripper to preprocess your hashing :

From ImageMagick package you have ...

mogrify -strip blah.jpg

and if you do

identify -list format

it apparently works with all the cited formats.

python - 仅计算图像的核心图像数据（不包括元数据）的哈希值

4 回答 4

Related

Reference