python - 在 Python 中比较两个 .txt 文件并将精确和相似的匹配项保存到 .txt 文件

Question

我需要的是：

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

当我使用“设置”时，输出将是：

apple
ice

（“相当于 re.match”）

但我想得到：

apple
ice
icecream

（“等同于研究”）

有什么办法可以做到这一点？文件很大，所以我不能只遍历它并使用正则表达式。

score 2 · Accepted Answer

2

你可能想看看difflib

于 2011-07-07T15:48:21.670 回答

score 1 · Accepted Answer

如果您只想从文件中提取一个是另一个子字符串的单词（包括那些相同的单词），您可以这样做：

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

或者，如果您希望根据字符串在字母顺序上的相似性来进行相似性，您可以按照 Paul 在他的回答中所建议的那样使用 difflib 提供的类之一：

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

我没有对这两个样本中的任何一个进行计时，但我猜第二个样本的运行速度会慢得多，因为对于每一对你都必须实例化一个对象......

python - 在 Python 中比较两个 .txt 文件并将精确和相似的匹配项保存到 .txt 文件

2 回答 2

Related

Reference