我正在尝试获取常规文本文件并删除在单独文件(停用词)中标识的单词,该文件包含要删除的单词,由回车符(“\n”)分隔。
现在我将两个文件都转换为列表,以便可以比较每个列表的元素。我让这个函数工作,但它不会删除我在停用词文件中指定的所有单词。任何帮助是极大的赞赏。
def elimstops(file_str): #takes as input a string for the stopwords file location
stop_f = open(file_str, 'r')
stopw = stop_f.read()
stopw = stopw.split('\n')
text_file = open('sample.txt') #Opens the file whose stop words will be eliminated
prime = text_file.read()
prime = prime.split(' ') #Splits the string into a list separated by a space
tot_str = "" #total string
i = 0
while i < (len(stopw)):
if stopw[i] in prime:
prime.remove(stopw[i]) #removes the stopword from the text
else:
pass
i += 1
# Creates a new string from the compilation of list elements
# with the stop words removed
for v in prime:
tot_str = tot_str + str(v) + " "
return tot_str