我有一个程序可以围绕特定关键字提取文本。我正在尝试对其进行修改,以便如果两个关键字足够接近,它只会显示一个较长的文本片段而不是两个单独的片段。
我当前的代码如下,将关键字后的单词添加到列表中,如果找到另一个关键字,则重置计数器。但是,我发现了两个问题。首先是我的 spyder 笔记本中的数据速率限制超出了,我一直无法处理。第二个是虽然这会产生更长的片段,但它不会消除重复。
有谁知道摆脱重复片段的方法,或者知道如何以不超过数据速率限制的方式合并片段(或知道如何更改 spyder 速率限制)?谢谢!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close