2

I have 2 text files, my goal is to find the lines in file First.txt that are not in Second.txt and output said lines to a third text file Missing.txt, i have that done:

fn = "Missing.txt"
try:
    fileOutPut = open(fn, 'w')
except IOError:
    fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
    line = line.strip()
    if line in bLines:
        continue
    else:
        fileOutPut.write(line)
        fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()

But after running the script i've come to a problem, there are lines that are very similar, examples:

[PR] Zero One Two Three ft Four

and (No space after the bracket)

[PR]Zero One Two Three ft Four

or

[PR] Zero One Two Three ft Four

and (capital F letter)

[PR] Zero One Two Three Ft Four

I have found SequenceMatcher, which does what i require, but how do i implement this into the comparison, since those are not just two strings, but a string and a set

4

1 回答 1

2

IIUC,即使空格或大写不同,您也希望匹配行。

一种简单的方法是删除空格并在读取时使所有内容都相同:

import re

def format_line(line):
    return re.sub("\s+", "", line.strip()).lower()

filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
    fline = format_line(line)
    if fline in bLines:
        continue
    else:
        fileOutPut.write(line + '\n')

更新 1:模糊匹配

如果您想模糊匹配,您可以执行类似nltk.metrics.distance.edit_distance( docs ) 的操作,但您无法绕过将每一行与其他每一行进行比较(最坏情况)。你失去了in操作的速度。

例如

from nltk.metrics.distance import edit_distance as dist

threshold = 3  # the maximum number of edits between lines

for line in filePrimary:
    fline = format_line(line)
    match_found = any([dist(fline, other_line) < threshold for other_line in bLines])

    if not match_found:
        fileOutPut.write(line + '\n')
于 2018-01-17T21:48:17.653 回答