0

我的目录有数百个文件,其中一些名称不同但内容重复。我已将文件分组到一个数组中并执行以下操作:

import os 
import itertools
import hashlib 
directory = os.listdir(input())
  for collection1, collection2 in itertools.combinations (directory, 2): 

    def check(data):
      data_check = hashlib.md5()
      data_check.update(open(data).read())
      return data_check.hexdigest()

    def match_check(c1, c2):
      return check(c1) == check(c2) 

match_check(collection1,collection2)
4

1 回答 1

0

您可以改为使用 a dict, 使用 theMD5作为键。例如:

files = {}

# In the loop:
  sum = hashlib.md5(open(data].read())
  if sum in files:
    # A file already exists for this MD5 sum, append the file
    files[sum].append(data)
  else:
    # First file with this MD5 sum
    files[sum] = [data]

然后,您可以列出dict共享相同索引的 的值。例如:

for sum, l in files.values():
  if l.length() > 1:
    # More than one file with the same MD5 file
    # Do something
于 2018-03-29T19:26:11.293 回答