我正在尝试使用 nltk naive 分类器对电影类型进行分类。然而,我得到了一些奇怪的结果。目前它仅根据输入的流派数量进行猜测。
如果我输入两部动作片,一部喜剧,每个猜测都会是动作。自然,我希望它基于输入的文本:
def RemoveStopWords(wordText):
keep_list = []
for word in wordText:
if word not in wordStop:
keep_list.append(word.lower())
return set(keep_list)
def getFeatures(element):
splitter=re.compile('\\W*')
f = {}
plot = [s for s in RemoveStopWords(splitter.split(element['imdb']['plot']))
if len(s)>5 and len(s) < 15]
for w in plot:
f[w]= w
return f
def FindFeaturesForList(MovieList):
featureSet = []
for w in MovieList:
print w['imdb']['title']
try:
for genre in w['imdb']['genres']:
featureSet.append((getFeatures(w), genre))
except:
print "Error when retriving genre, skipping element"
return featureSet
featureList = FindFeaturesForList(trainset)
cl = nltk.NaiveBayesClassifier.train(featureList)
因此,每当我执行 cl.classify(movie) 时,它都会返回最常输入的类型,我做错了什么?