python - Windows 和 Linux 中的 Hashlib

Question

我正在用 Python 编写一个 p2p 应用程序，并使用 hashlib 模块来识别网络中内容相同但名称不同的文件。

问题是我使用 Python 2.7 测试了在 Windows (Vista) 中对文件进行哈希处理的代码，它非常快（不到一秒，几 GB）。因此，在 Linux 中（Fedora 12，我自己编译的 Python 2.6.2 和 Python 2.7.1，因为我还没有找到 yum 的 rpm）慢得多，对于小于 1gb 的文件几乎一分钟。

问题是，为什么？我可以做些什么来提高 Linux 的性能吗？

哈希的代码是

import hashlib
...

def crear_lista(directorio):

   lista = open(archivo, "w")

   for (root, dirs, files) in os.walk(directorio):
      for f in files:
         #archivo para hacerle el hash
         h = open(os.path.join(root, f), "r")

         #calcular el hash de los archivos
         md5 = hashlib.md5()

         while True:
            trozo = h.read(md5.block_size)
            if not trozo: break
            md5.update(trozo)

         #cada linea es el nombre de archivo y su hash
         size = str(os.path.getsize(os.path.join(root, f)) / 1024)
         digest = md5.hexdigest()

         #primera linea: nombre del archivo
         #segunda: tamaño en KBs
         #tercera: hash
         lines = f + "\n" + size + "\n" + digest + "\n"
         lista.write(lines)

         del md5
         h.close()

   lista.close()

我改变了r，rb但rU结果是一样的

score 3 · Accepted Answer

您正在读取 64 字节 ( hashlib.md5().block_size) 块中的文件并对其进行散列处理。

您应该使用 256KB（262144 字节）到 4MB（4194304 字节）范围内的更大读取值，然后对其进行哈希处理；这个digup程序读取 1MB 块，即：

block_size = 1048576 # 1MB
while True:
    trozo = h.read(block_size)
    if not trozo: break
    md5.update(trozo)

python - Windows 和 Linux 中的 Hashlib

1 回答 1

Related

Reference