3

我正在尝试获取常规文本文件并删除在单独文件(停用词)中标识的单词,该文件包含要删除的单词,由回车符(“\n”)分隔。

现在我将两个文件都转换为列表,以便可以比较每个列表的元素。我让这个函数工作,但它不会删除我在停用词文件中指定的所有单词。任何帮助是极大的赞赏。

def elimstops(file_str): #takes as input a string for the stopwords file location
  stop_f = open(file_str, 'r')
  stopw = stop_f.read()
  stopw = stopw.split('\n')
  text_file = open('sample.txt') #Opens the file whose stop words will be eliminated
  prime = text_file.read()
  prime = prime.split(' ') #Splits the string into a list separated by a space
  tot_str = "" #total string
  i = 0
  while i < (len(stopw)):
    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text
    else:
      pass
    i += 1
  # Creates a new string from the compilation of list elements 
  # with the stop words removed
  for v in prime:
    tot_str = tot_str + str(v) + " " 
  return tot_str
4

3 回答 3

3

这是使用生成器表达式的替代解决方案。

tot_str = ' '.join(word for word in prime if word not in stopw)

为了提高效率,stopw变成setusing stopw = set(stopw)

如果 sample.txt 不仅仅是一个空格分隔的文件,您当前的方法可能会遇到问题,例如,如果您有带有标点符号的普通句子,那么在空格上拆分会将标点符号作为单词的一部分。要解决此问题,您可以使用该re模块将字符串拆分为空格或标点符号:

import re
prime = re.split(r'\W+', text_file.read())
于 2012-10-22T16:49:14.867 回答
0

我认为你的问题是这一行:

    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

只会删除第一次出现的stopw[i]from prime。要解决此问题,您应该这样做:

    while stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

但是,这将运行得非常缓慢,因为 thein primeprime.removebits 都必须遍历素数。这意味着你最终会在你的字符串长度上得到一个二次运行时间。如果您使用 FJ建议的生成器,您的运行时间将是线性的,这要好得多。

于 2012-10-22T16:58:40.600 回答
0

I don't know python, but here is a general way to do it which is O(n)+O(m) time - linear.

1: Add all words from stopwords file to a map.
2: Read your regular text file and try to add the words to a list. While you do #2 check if currently read word is in map, if it is skip it, otherwise add it to list.

At the end, the list should have all the words that you need - the words that you wanted removed.

于 2012-10-22T16:57:07.833 回答