python - 比较python中的两个文件，每个文件都有重复的数据

Question

我正在尝试比较 python 中的两个文件，它实际上是在将它与旧警告文件进行比较后试图找到新的警告。

旧文件的内容是这样的：

warning1~file1
warning1~file1
warning2~file2
warning2~file2
warning2~file2

新文件的内容是这样的

warning1~file1
warning1~file1
warning1~file1
warning3~file3
warning2~file2
warning2~file2
warning2~file2

如您所见，在新文件中，我有 2 行新的文本警告 1~file1 警告 3~file3，我在互联网上搜索了两个文件的比较，但他们认为每一行文本都是不同的。

small_file = open('file1','r')
long_file = open('file2','r')
output_file = open('newfile','w')

try:
    small_lines = small_file.readlines()
    small_lines_cleaned = [line.rstrip().lower() for line in small_lines]
    long_lines = long_file.readlines()
    long_lines_cleaned = [line.rstrip().lower() for line in long_lines]

    #for line in long_lines_cleaned:
    for line in long_lines_cleaned:
        if line not in small_lines_cleaned:
            output_file.writelines(line)

我尝试了我在这里找到的这段代码，但是在运行它之后，我意识到它也只是检查 file1 中的一行是否在 file2 中可用。如果没有，则写入新文件。此方法只获取warning3，而不是新的warning1。

我需要每行只比较一次的东西......剩下的行被写入新文件。

我希望我已经正确解释了这个问题。

score 2 · Accepted Answer

我会使用 aCounter来查找出现次数的差异，例如：

from collections import Counter

with open('file1', 'r') as f1, open('file2', 'r') as f2, open('newfile', 'w') as output:
    f1_lines = [line.rstrip().lower() for line in f1.readlines()]
    f2_lines = [line.rstrip().lower() for line in f2.readlines()]
    diff = Counter(f2_lines) - Counter(f1_lines)
    for msg, n in diff.iteritems():
        output.writelines((msg + '\n') * n)

python - 比较python中的两个文件，每个文件都有重复的数据

1 回答 1

Related

Reference