我有以下输入数据,我想从此输入中删除停用词并想做标记化:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
我尝试了以下代码,但没有得到想要的结果:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
期望输出如下:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]