python - python从多个文件中删除类似的字符串

Question

我已经从不同的网站爬取了 txt 文件，现在我需要将它们粘贴到一个文件中。来自各个网站的许多行彼此相似。我想删除重复。这是我尝试过的：

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:

    destfile = open('%s.txt' % destname, 'a+')
    destlines = destfile.readlines()

    similar = False
    for destline in destlines:
        ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
        if ratio > 0.8:
            print destline
            print sourceline
            similar = True

    if not similar:
        destfile.write(sourceline)

    destfile.close()

我将为每个源运行它，并逐行写入同一个文件。结果是，即使我多次为同一个文件运行它，该行总是附加到目标文件中。

编辑：我已经尝试了答案的代码。它仍然很慢。即使我最小化 IO，我仍然需要比较 O(n^2)，尤其是当你有 1000 多行时。我每个文件平均有 10,000 行。

还有其他方法可以删除重复项吗？

score 2 · Accepted Answer

这是一个简短的版本，它执行最少的 IO 并自行清理。

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'

with open('%s.txt' % destname, 'w+') as destfile:

  # we read in the file so that on subsequent runs of this script, we 
  # won't duplicate the lines.
  known_lines = set(destfile.readlines())

  with open('%s.txt' % sourcename) as sourcefile:
    for line in sourcefile:
      similar = False
      for known in known_lines:
        ratio = difflib.SequenceMatcher(None, line, known).ratio()
        if ratio > 0.8:
          print ratio
          print line
          print known
          similar = True
          break
      if not similar:
        destfile.write(line)
        known_lines.add(line)

我们不是每次都从文件中读取已知行，而是将它们保存到一个集合中，用于比较。该集合本质上是“destfile”内容的镜像。

关于复杂性的说明

就其本质而言，这个问题具有 O(n ² ) 的复杂性。因为您正在寻找与已知字符串的相似性，而不是相同的字符串，所以您必须查看每个先前看到的字符串。如果您希望删除精确的重复项，而不是模糊匹配，则可以在集合中使用简单的查找，复杂度为 O(1)，使您的整个解决方案具有 O(n) 复杂度。

可能有一种方法可以通过对字符串使用有损压缩来降低基本复杂性，以便两个相似的字符串压缩到相同的结果。然而，这既超出了堆栈溢出答案的范围，也超出了我的专业知识。这是一个活跃的研究领域，因此您可能会幸运地挖掘文献。

您还可以ratio()通过使用不太准确的替代方案quick_ratio()和real_quick_ratio().

score 0 · Accepted Answer

你的代码对我来说很好。当行相似时（在我使用的示例中，完全相同），它会将 destline 和 sourceline 打印到标准输出，但它只将唯一的行写入文件一次。您可能需要根据ratio您的特定“相似性”需求将阈值设置得较低。

score 0 · Accepted Answer

基本上你需要做的是检查源文件中的每一行，看看它是否与目标文件的每一行有潜在的匹配。

##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data

##bindresult.txt
##--------------
##a website line
##this is data
##and more data

from difflib import SequenceMatcher

sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()

destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()


has_matches = {k: False for k in sourcelines}

for d_line in destlines:

    for s_line in sourcelines:

        if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
            has_matches[s_line] = True
            break

for k in has_matches:
    if has_matches[k] == False:
        destfile.write(k)

destfile.close()

这会将完全不同的东西添加到目标文件中。

python - python从多个文件中删除类似的字符串

3 回答 3

关于复杂性的说明

Related

Reference