python - 在 Python 中区分二进制文件

Question

我有两个二进制文件。它们看起来像这样，但数据更随机：

档案一：

FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...

文件 B：

41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...

我想要的是这样的称呼：

>>> someDiffLib.diff(file_a_data, file_b_data)

并收到类似的东西：

[Match(pos=4, length=4)]

表示在两个文件中，位置 4 的字节对于 4 个字节是相同的。序列44 43 42 41将不匹配，因为它们在每个文件中的位置不同。

有没有图书馆可以为我做差异？还是我应该只编写循环来进行比较？

score 11 · Accepted Answer

您可以itertools.groupby()为此使用，这是一个示例：

from itertools import groupby

# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
    if k:
        pos = next(g)
        length = len(list(g)) + 1
        matches.append((pos, length))

或者使用列表推导与上述相同：

matches = [(next(g), len(list(g))+1)
           for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
               if k]

如果您使用的是 Python 2.x，以下是示例的设置：

f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

score 3 · Accepted Answer

提供的itertools.groupby 解决方案工作正常，但速度很慢。

我写了一个非常天真的尝试numpy，在我碰巧拥有的一个特定的 16MB 文件上使用并测试了它与其他解决方案，它在我的机器上快了大约 42 倍。熟悉的numpy人可能会显着改善这一点。

import numpy as np

def compare(path1, path2):
    x,y = np.fromfile(path1, np.int8), np.fromfile(path2, np.int8)
    length = min(x.size, y.size)
    x,y = x[:length], y[:length]

    z = np.where(x == y)[0]
    if(z.size == 0) : return z

    borders = np.append(np.insert(np.where(np.diff(z) != 1)[0] + 1, 0, 0), len(z))
    lengths = borders[1:] - borders[:-1]
    starts = z[borders[:-1]]
    return np.array([starts, lengths]).T

python - 在 Python 中区分二进制文件

2 回答 2

Related

Reference