python - 使用多处理和 ssdeep Python 对类似文件进行分组时遇到问题

Question

我正在尝试使用 deephashes (SSDEEP) http://code.google.com/p/pyssdeep处理超过 130 万个文件

它的作用是，。它生成哈希（在 3-6 分钟内生成 130 万），然后相互比较以获得相似性结果。比较非常快，但仅运行单个进程不会完成。所以我们放入 Python 多处理模块来完成任务。

结果是在 30 分钟内完成了 130 万个文本文件。使用 18 个内核（四核至强处理器，共 24 个 CPU）

以下是每个过程的工作原理：

生成 SSDEEP 和。
将这些总和列表拆分为 5000 组块。
在 18 个过程中比较每个块 1 与 5000：每次迭代比较 18 个总和。
根据相似度得分对结果进行分组（默认值为 75）
删除了已经检查过下一次迭代的文件。
从下一个文件开始，下一个组的分数 < 75%
重复直到所有组都完成。
如果有未包含的文件（与任何文件不相似），则将它们添加到剩余列表中。

当所有处理完成后，剩余的文件被合并并递归地相互比较，直到没有结果。

问题是，当文件列表被分块成更小的 (5000) 文件时。有些文件包含在前 5000 个块中，但未包含在另一个组中，使组不完整。

如果我在没有分块的情况下运行，则循环完成需要很长时间。超过 18 小时未完成，。不知道过了多久。

请给我建议。

使用的模块：multiprocessing.Pool，ssdeep python

def ssdpComparer(lst, threshold):
    s = ssdeep()
    check_file = []
    result_data = []
    lst1 = lst
    set_lst = set(lst)

    print '>>>START'
    for tup1 in lst1:
        if tup1 in check_file:
            continue
        for tup2 in set_lst:
            score = s.compare(tup1[0], tup2[0])
            if score >= threshold:
                result_data.append((score, tup1[2], tup2[2])) #Score, GroupID, FileID
                check_file.append(tup2)
        set_lst = set_lst.difference(check_file)
    print """####### DONE #######"""
    remain_lst = set(lst).difference(check_file)

    return (result_data, remain_lst)



def parallelProcessing(tochunk_list, total_processes, threshold, source_path, mode, REMAINING_LEN = 0):
    result = []
    remainining = []
    pooled_lst = []
    pair = []
    chunks_toprocess = []

    print 'Total Files:', len(tochunk_list)

    if mode == MODE_INTENSIVE:
        chunks_toprocess = groupWithBlockID(tochunk_list) #blockID chunks
    elif mode == MODE_THOROUGH:
        chunks_toprocess = groupSafeLimit(tochunk_list, TOTAL_PROCESSES) #Chunks by processes
    elif mode == MODE_FAST:
        chunks_toprocess = groupSafeLimit(tochunk_list) #5000 chunks

    print 'No. of files group to process: %d' % (len(chunks_toprocess))
    pool_obj = Pool(processes = total_processes, initializer = poolInitializer, initargs = [None, threshold, source_path, mode])
    pooled_lst = pool_obj.map(matchingProcess, chunks_toprocess) #chunks_toprocess
    tmp_rs, tmp_rm = getResultAndRemainingLists(pooled_lst)
    result += tmp_rs
    remainining += tmp_rm

    print 'RESULT LEN: %s, REMAINING LEN: %s, P.R.L: %s' % (len(result), len(remainining), REMAINING_LEN)
    tmp_r_len = len(remainining)

    if tmp_r_len != REMAINING_LEN and len(result) > 0 :
        result += parallelProcessing(remainining, total_processes, threshold, source_path, mode, tmp_r_len)
    else:
        result += [('','', rf[2]) for rf in remainining]

    return result

def getResultAndRemainingLists(pooled_lst):
    g_result = []
    g_remaining = []

    for tup_result in pooled_lst:
        tmp_result, tmp_remaining = tup_result
        g_result += tmp_result
        if tmp_remaining:
            g_remaining += tmp_remaining

    return (g_result, g_remaining)

score 0 · Accepted Answer

第一个建议：在您的情况下，不需要将check_file作为 list => 将其更改为set() - 那么它应该会更好（最后有解释）。

如果你需要有块，也许这样的过程就足够了：

def split_to_chunks(wholeFileList):
    s = ssdeep()
    calculated_chunks = []
    for someFileId in wholeFileList:
        for chunk in calculated_chunks:
            if s.compare(chunk[0], someFileId) > threshold:
                chunk.append(someFileId)
                break
        else: # important: this else is on 'for ' level
            # so if there was no 'break' so someFileId is a base for new chunk:
            calculated_chunks.append( [someFileId] )
    return calculated_chunks

之后您可以过滤结果： groups = filter(lambda x: len(x) > 1, result) 仍然 = filter(lambda x: len(x) == 1, result)

注意：这个算法假设块的第一个元素是一种“基础”。结果的好坏很大程度上取决于 ssdeep 的行为（我可以想象一个奇怪的问题：多少 ssdeep 是可传递的？）如果这种相似性那么它应该是......

最坏的情况是，如果任何对 s.compare(fileId1, fileId2) 的分数不满足阈值条件（那么复杂度为 n^2，因此在您的情况下为 130 万 * 130 万）。

没有简单的方法来优化这种情况。让我们想象一下情况，其中 s.compare(file1, file2) 总是接近 0 然后（据我所知）即使你知道 s.compare(A, B) 非常低而 s.compare(B, C) 非常低那么你仍然不能说任何关于 s.compare(A, C) => 所以你需要有 n*n 操作。

另一个注意事项：假设您使用了太多的结构和列表，例如：

set_lst = set_lst.difference(check_file)

该指令创建新的 set() 并且 set_lst 和 check_file 中的所有元素必须至少被触摸一次，因为 check_file 是一个列表，因此无法优化“差异”函数并且它变得复杂：len(check_file) * log( len(set_lst))

基本上：如果这些结构在增长（130 万个，差不多），那么你的计算机需要执行更多的计算。如果您使用 check_file = set() 而不是 [] (list) 那么它的复杂性应该是： len(set_lst) + len(check_file)

与检查元素是否在 python 的列表（数组）中相同：

if tup1 in check_file:

因为check_file是列表 -> 如果 tup1 不在列表中，您的 cpu 需要将 tup1 与所有元素进行比较，因此其复杂性为 len(check_file) 如果您将 check_file 更改为 set 则其复杂性约为 log2(len (check_file)) 让差异更直观，假设 len(*check_file*) = 1mln，你需要多少比较？

设置：log2(1m) = log2(1000000) ~ 20

列表：len(check_file) = 100 万

python - 使用多处理和 ssdeep Python 对类似文件进行分组时遇到问题

1 回答 1

Related

Reference