python - 如何在 Python 中删除输出文件中的重复条目？

Question

我对 Python 很陌生。我正在尝试从以下格式的文本文件中提取数据：

85729 块寻址索引近似文本检索

85730 自动查询扩展基于分歧等...

输出文本文件是单词列表，但没有重复条目。输入的文本文件可以有重复项。输出将如下所示：

堵塞

寻址

指数

近似

ETC....

到目前为止，使用我的代码，我可以获得单词列表，但包含重复项。在将单词输入输出文件之前，我尝试检查重复项，但输出并未反映这一点。有什么建议么？我的代码：

infile = open("paper.txt", 'r')
outfile = open("vocab.txt", 'r+a')
lines = infile.readlines()
for i in lines:
   thisline = i.split()
   for word in thisline:
       digit = word.isdigit()
       found = False
       for line in outfile:
            if word in line:
                found = True
                break  
       if (digit == False) and (found == False ):   
                    outfile.write(word);
                    outfile.write("\n");

我不明白如何在 Python 中关闭 for 循环。在 C++ 或 Java 中，花括号可用于定义 for 循环的主体，但我不确定它在 Python 中是如何完成的。任何人都可以帮忙吗？

score 1 · Accepted Answer

Python 循环通过缩进闭合；左边的空格有语义。这可以让你免于疯狂地输入花括号或 do/od 或其他任何东西，并消除了一类错误，即你的缩进不小心不能准确地反映你的控制流。

您的输入似乎不足以证明对输出文件的循环是合理的（如果确实如此，我可能还是会使用 gdbm 表），所以您可能可以做这样的事情（测试非常简单）：

#!/usr/local/cpython-3.3/bin/python

with open('/etc/crontab', 'r') as infile, open('output.txt', 'w') as outfile:
    seen = set()
    for line in infile:
        for word in line.split():
            if word not in seen:
                seen.add(word)
                outfile.write('{}\n'.format(word))

python - 如何在 Python 中删除输出文件中的重复条目？

1 回答 1

Related

Reference