python - 从非常大的文件中删除稀有词

Question

给定一个非常大的文本文件，我想删除文件中只出现一次的所有单词。有什么简单有效的方法吗？

此致，

score 8 · Accepted Answer

您必须通过文件执行 2 次：

在通道 1 中：

使用单词作为键并将它们的出现作为值来构建字典（即每次阅读一个单词时，将其在字典中的值加 1）
然后预处理列表以删除所有值大于 1 的键。这现在是您的“黑名单”

在通道 2 中：

再次通读文件并删除黑名单中匹配的任何单词。

运行：

两遍中读取文件的线性时间。
将每个单词添加到字典中需要 O(1) 时间/在 pass 1 中增加其值。
将字典预处理到黑名单需要 O(n)。
在第 2 遍中查找黑名单需要 O(1)。

O(n) 复杂度

score 2 · Accepted Answer

2 passes through the file are definitely necessary. However, if the rare words are truly rare then you can skip tokenizing large sections of the file on the second pass. First do a word by word pass through the file and build a dictionary that contains the found location for words encountered once or a placeholder value for words encountered twice.

MULTI_WORD = -1
word_locations = {}

for pos, word in tokenize(input_file):
    if word not in word_locations:
        word_locations[word] = pos
    else:
        word_locations[word] = MULTI_WORD

Then you can filter out the positions where you need to do edits and do a plain copy on the rest:

edit_points = [(pos, len(word)) for word, pos in word_locations.iteritems()
                                if pos != MULTI_WORD]

start_pos = 0
for end_pos, edit_length in edit_points:
    input_file.seek(start_pos)
    output_file.write(input_file.read(end_pos - start_pos))
    start_pos = end_pos + edit_length
input_file.seek(start_pos)
output_file.write(input_file.read())

You might want a couple of more optimizations, like a block wise copy procedure to save on memory overhead and special case for no edit points.

score 0 · Accepted Answer

如果没有具体的代码可以参考，很难知道，但是一个好的起点可能是Python 的自然语言工具包。

python - 从非常大的文件中删除稀有词

3 回答 3

Related

Reference