python - 在python中计算文件的crc

Question

我想计算文件的CRC并得到如下输出：E45A12AC. 这是我的代码：

#!/usr/bin/env python 
import os, sys
import zlib

def crc(fileName):
    fd = open(fileName,"rb")
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        zlib.crc32(eachLine)

for eachFile in sys.argv[1:]:
    crc(eachFile)

这会计算每一行的 CRC，但它的输出（例如-1767935985）不是我想要的。

Hashlib 以我想要的方式工作，但它计算 md5：

import hashlib
m = hashlib.md5()
for line in open('data.txt', 'rb'):
    m.update(line)
print m.hexdigest()

是否可以使用获得类似的东西zlib.crc32？

score 33 · Accepted Answer

更紧凑和优化的代码

def crc(fileName):
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

PS2：旧 PS 已被弃用 - 因此已删除 - 因为评论中的建议。谢谢你。我不明白，我怎么错过了这个，但它真的很好。

score 18 · Accepted Answer

kobor42 答案的修改版本，通过读取固定大小的块而不是“行”，性能提高了 2-3 倍：

import zlib

def crc32(fileName):
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

还包括返回字符串中的前导零。

score 13 · Accepted Answer

用于 CRC-32 支持的hashlib兼容接口：

导入 zlib

crc32 类（对象）：
    名称 = 'crc32'
    摘要大小 = 4
    块大小 = 1

    def __init__(self, arg=''):
        self.__digest = 0
        自我更新（arg）

    定义副本（自我）：
        复制=超级（self.__class__，self）.__new__（self.__class__）
        复制.__digest = self.__digest
        返回副本

    def 摘要（自我）：
        返回自我.__摘要

    def hexdigest（自我）：
        返回 '{:08x}'.format(self.__digest)

    定义更新（自我，arg）：
        self.__digest = zlib.crc32(arg, self.__digest) & 0xffffffff

# 现在你可以定义 hashlib.crc32 = crc32
导入哈希库
hashlib.crc32 = crc32

# Python > 2.7: hashlib.algorithms += ('crc32',)
# Python > 3.2: hashlib.algorithms_available.add('crc32')

score 9 · Accepted Answer

要将任何整数的最低 32 位显示为 8 个不带符号的十六进制数字，您可以按位“屏蔽”该值，并使用由 32 位组成的掩码将其全部设为值 1，然后应用格式设置。IE：

>>> x = -1767935985
>>> format(x & 0xFFFFFFFF, '08x')
'969f700f'

您因此格式化的整数来自zlib.crc32或任何其他计算都完全无关紧要。

score 5 · Accepted Answer

Python 3.8+（使用海象运算符）：

import zlib

def crc32(filename, chunksize=65536):
    """Compute the CRC-32 checksum of the contents of the given filename"""
    with open(filename, "rb") as f:
        checksum = 0
        while (chunk := f.read(chunksize)) :
            checksum = zlib.crc32(chunk, checksum)
        return checksum

chunksize是您一次读取文件的字节数。不管你把它设置成什么，你都会为同一个文件得到相同的哈希值（设置得太低可能会使你的代码变慢，太高可能会占用太多内存）。

结果是一个 32 位整数。空文件的 CRC-32 校验和为0.

score 3 · Accepted Answer

编辑为在下面包含 Altren 的解决方案。

CrouZ 答案的修改和更紧凑的版本，性能略有提高，使用 for 循环和文件缓冲：

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

结果，在 6700k 硬盘中：

（注意：经过多次重新测试，它始终更快。）

Warming up the machine...
Finished.

Beginning tests...
File size: 90288KB
Test cycles: 500

With for loop and buffer.
Result 45.24728019630359 

CrouZ solution
Result 45.433838356097894 

kobor42 solution
Result 104.16215688703986 

Altren solution
Result 101.7247863946586

使用以下脚本在 Python 3.6.4 x64 中测试：

import os, timeit, zlib, random, binascii

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

def crc32(fileName):
    """CrouZ solution"""
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

def crc(fileName):
    """kobor42 solution"""
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

def crc32altren(filename):
    """Altren solution"""
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

fpath = r'D:\test\test.dat'
tests = {forLoopCrc: 'With for loop and buffer.', 
     crc32: 'CrouZ solution', crc: 'kobor42 solution',
         crc32altren: 'Altren solution'}
count = 500

# CPU, HDD warmup
randomItm = [x for x in tests.keys()]
random.shuffle(randomItm)
print('\nWarming up the machine...')
for c in range(count):
    randomItm[0](fpath)
print('Finished.\n')

# Begin test
print('Beginning tests...\nFile size: %dKB\nTest cycles: %d\n' % (
    os.stat(fpath).st_size/1024, count))
for x in tests:
    print(tests[x])
    start_time = timeit.default_timer()
    for c in range(count):
        x(fpath)
    print('Result', timeit.default_timer() - start_time, '\n')

它更快，因为for循环比while循环更快（来源：here和here）。

score 2 · Accepted Answer

合并以上2个代码如下：

try:
    fd = open(decompressedFile,"rb")
except IOError:
    logging.error("Unable to open the file in readmode:" + decompressedFile)
    return 4
eachLine = fd.readline()
prev = 0
while eachLine:
    prev = zlib.crc32(eachLine, prev)
    eachLine = fd.readline()
fd.close()

score 0 · Accepted Answer

您可以使用 base64 像 [ERD45FTR] 一样退出。zlib.crc32 提供更新选项。

import os, sys
import zlib
import base64

def crc(fileName):
  fd = open(fileName,"rb")
  content = fd.readlines()
  fd.close()
  prev = None
  for eachLine in content:
   if not prev:
     prev = zlib.crc32(eachLine)
   else:
     prev = zlib.crc32(eachLine, prev)
  return prev

for eachFile in sys.argv[1:]:
  print base64.b64encode(str(crc(eachFile)))

score 0 · Accepted Answer

解决方案：

import os, sys
import zlib

def crc(fileName, excludeLine="", includeLine=""):
  try:
        fd = open(fileName,"rb")
  except IOError:
        print "Unable to open the file in readmode:", filename
        return
  eachLine = fd.readline()
  prev = None
  while eachLine:
      if excludeLine and eachLine.startswith(excludeLine):
            continue   
      if not prev:
        prev = zlib.crc32(eachLine)
      else:
        prev = zlib.crc32(eachLine, prev)
      eachLine = fd.readline()
  fd.close()    
  return format(prev & 0xFFFFFFFF, '08x') #returns 8 digits crc

for eachFile in sys.argv[1:]:
    print crc(eachFile)

真的不知道是什么 (excludeLine="", includeLine="")...

score 0 · Accepted Answer

使用 binascii 计算 CRC 有更快、更紧凑的方法：

import binascii

def crc32(filename):
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

python - 在python中计算文件的crc

10 回答 10

Related

Reference