给定一个非常大的文本文件,我想删除文件中只出现一次的所有单词。有什么简单有效的方法吗?
此致,
您必须通过文件执行 2 次:
在通道 1 中:
在通道 2 中:
运行:
O(n) 复杂度
2 passes through the file are definitely necessary. However, if the rare words are truly rare then you can skip tokenizing large sections of the file on the second pass. First do a word by word pass through the file and build a dictionary that contains the found location for words encountered once or a placeholder value for words encountered twice.
MULTI_WORD = -1
word_locations = {}
for pos, word in tokenize(input_file):
if word not in word_locations:
word_locations[word] = pos
else:
word_locations[word] = MULTI_WORD
Then you can filter out the positions where you need to do edits and do a plain copy on the rest:
edit_points = [(pos, len(word)) for word, pos in word_locations.iteritems()
if pos != MULTI_WORD]
start_pos = 0
for end_pos, edit_length in edit_points:
input_file.seek(start_pos)
output_file.write(input_file.read(end_pos - start_pos))
start_pos = end_pos + edit_length
input_file.seek(start_pos)
output_file.write(input_file.read())
You might want a couple of more optimizations, like a block wise copy procedure to save on memory overhead and special case for no edit points.
如果没有具体的代码可以参考,很难知道,但是一个好的起点可能是Python 的自然语言工具包。