3

我正在尝试读取文本文件并从中删除所有停用词。但是,我在使用b[i].pop(j). 但是如果我使用print(b[i][j]),我不会收到任何错误并将单词作为输出。任何人都可以发现错误吗?

import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

fo = open("text.txt", "r")
# text.txt is just a text document

list = fo.read();
list = list.replace("\n","")
# removing newline character

b = list.split('.',list.count('.'))
# splitting list into lines

for i in range (len(b) - 1) :
    b[i] = b[i].split()
# splitting each line into words

for i in range (0,len(b))   :
    for j in range (0,len(b[i]))    :
        if b[i][j] in stop  :
            b[i].pop(j)
#           print(b[i][j])
#print(b)

# Close opend file
fo.close()

输出:

Traceback (most recent call last):
  File "prog.py", line 29, in <module>
    if b[i][j] in stop  :
IndexError: list index out of range

评论b[i].pop(j)和取消评论的输出print(b[i][j])

is
that
the
from
the
the
the
can
the
and
and
the
is
and
can
be
into
is
a
or
4

1 回答 1

1

您在迭代它们时从列表中删除元素,这会导致列表在迭代期间缩小,但迭代仍会继续原始列表的长度,从而导致此类InderError问题。

相反,您应该尝试创建一个仅包含所需元素的新列表。例子 -

result = []
for i in range (0,len(b)):
    templist = []
    for j in range (0,len(b[i])):
        if b[i][j] not in stop :
            templist.append(b[i][j])
    result.append(templist)

同样可以在列表理解中完成 -

result = [[word for word in sentence if word not in stop] for sentence in b]
于 2015-10-19T10:16:52.520 回答