我是新蟒蛇。我有一个单词列表和一个非常大的文件。我想从单词列表中删除文件中包含单词的行。
单词列表按排序给出,可以在初始化期间输入。我正在尝试找到解决此问题的最佳方法。我现在正在做一个线性搜索,它花费了太多时间。
有什么建议么?
我是新蟒蛇。我有一个单词列表和一个非常大的文件。我想从单词列表中删除文件中包含单词的行。
单词列表按排序给出,可以在初始化期间输入。我正在尝试找到解决此问题的最佳方法。我现在正在做一个线性搜索,它花费了太多时间。
有什么建议么?
您可以使用intersection
from set theory 来检查单词列表和一行中的单词是否有任何共同点。
list_of_words=[]
sett=set(list_of_words)
with open(inputfile) as f1,open(outputfile,'w') as f2:
for line in f1:
if len(set(line.split()).intersection(sett))>=1:
pass
else:
f2.write(line)
If the source file contains only words separated by whitespace, you can use sets:
words = set(your_words_list)
for line in infile:
if words.isdisjoint(line.split()):
outfile.write(line)
Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar']
a line like foo, bar,stuff
won't be removed. To handle this, you need regular expressions:
rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
if not re.search(rr, line):
outfile.write(line)
You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copy
for this).
The rest reads like pseudo-code:
forbidden_words = set("these words shall not occur".split())
with open(inputfile) as infile, open(outputfile, 'w+') as outfile:
outfile.writelines(line for line in infile
if not any(word in forbidden_words for word in line.split()))
See this question for approaches how to get rid of punctuation-induced false-negatives.
大文件中的行和单词需要以某种方式进行排序,在这种情况下,您可以实现二进制搜索。看起来它们并不好,所以您可以做的最好的事情是通过检查列表中的每个单词是否在给定行中进行线性搜索。
contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)
类似的东西应该可以工作......不确定它是否会比逐行进行更快
[编辑] 这是未经测试的代码,可能需要一些细微的调整