python - 测试 file1 中的行是否是 file2 中的行的子集

Question

我曾尝试在网上搜索答案，但不幸的是没有成功。因此我在这里问：

我试图弄清楚是否所有行都file1存在于file2. 幸运的是，我可以只比较整行而不是单个单词等。不幸的是，我正在处理 GB 文件，所以我尝试过的一些基本解决方案给了我记忆错误。

目前我有以下代码不起作用。一些指导将不胜感激。

# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
    file1 = open(file1, "r")


    for line1 in file1:        
        with open(file2, "r+b") as f:

            mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
            my_str_as_bytes = str.encode(line1)
            result = mm.find(line1.strip().encode())
            print(result)
            if result == -1:
                return False
    return True

示例文件2：

This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.

如果例如 file1 是，则应该通过：

This is line4.
This is line5.

如果例如 file1 是，则应该失败：

This is line4.
This is line10.

编辑：我刚刚为其他人添加了我的代码的工作版本。没有内存错误，但速度很慢。

score 0 · Accepted Answer

我不确定为什么它不起作用，但我想我知道如何解决它：

def is_subset_of(file1, file2):
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
        for line in f1:
            line = line.strip()
            f2.seek(0)   # go to the start of f2
            if line not in (line2.strip() for line2 in f2):
                return False
    return True

这避免了多次打开第二个文件，因为总是为每一行重新开始，并且在任何时候你只在内存中保存 2 行。那应该是对内存非常友好的。

另一种方法（可能更快）是对file1和进行排序file2。这样，如果字符串在词法上小于第一个文件中的字符串，则可以逐行比较并移至另一个文件中的下一行。而不是O(n**2)可以在O(n*log(n)). 然而，这要复杂得多，我不知道对 GB 的文件进行排序是否有意义（可能会占用太多内存！）。

score 0 · Accepted Answer

处理不适合内存的文件总是很困难。

如果file1适合内存但file2太大，这是一个解决方案：

# file1 and file2 are open file-like objects
unseen = set(file1)
for line in file2:
    unseen -= {line} # avoid exception from set.remove
#if unseen is empty, all lines were found in file2

否则，您应该至少对其中一个文件进行排序（或者可能是 CFBS 排序）。

python - 测试 file1 中的行是否是 file2 中的行的子集

2 回答 2

Related

Reference