0

我正在尝试使用 nltk naive 分类器对电影类型进行分类。然而,我得到了一些奇怪的结果。目前它仅根据输入的流派数量进行猜测。

如果我输入两部动作片,一部喜剧,每个猜测都会是动作。自然,我希望它基于输入的文本:

def RemoveStopWords(wordText):
   keep_list = []
   for word in wordText:
        if word not in wordStop:
            keep_list.append(word.lower())

   return set(keep_list)

def getFeatures(element):

   splitter=re.compile('\\W*')
   f = {}
   plot = [s for s in RemoveStopWords(splitter.split(element['imdb']['plot']))
   if len(s)>5 and len(s) < 15]

   for w in plot:
           f[w]= w

   return f

def FindFeaturesForList(MovieList):
    featureSet = []
    for w in MovieList:
        print w['imdb']['title']
        try:
            for genre in w['imdb']['genres']:
                featureSet.append((getFeatures(w), genre))
        except:
            print "Error when retriving genre, skipping element"

    return featureSet

featureList = FindFeaturesForList(trainset)
cl = nltk.NaiveBayesClassifier.train(featureList)

因此,每当我执行 cl.classify(movie) 时,它都会返回最常输入的类型,我做错了什么?

4

1 回答 1

0

在 nltk book 的电影评论分类示例中,注意收集所有电影中所有单词的频率,然后只选择最常见的单词作为特征键。

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

我认为重要的是要注意这是一个选择。以这种方式选择功能键不是强制性的。其他一些聪明的特征选择可能会导致更好的分类器。选择好的特征是科学背后的艺术。

无论如何,也许尝试在您的分类器中使用相同的想法:

def getFeatures(text, word_features):
    text = text.lower()
    f = {word: word in text for word in word_features}
    return f


def FindFeaturesForList(MovieList):
    featureSet = []
    splitter = re.compile('\\W*')
    all_words = nltk.FreqDist(
        s.lower()
        for w in MovieList
        for s in RemoveStopWords(splitter.split(w['imdb']['plot']))
        if len(s) > 5 and len(s) < 15)
    word_features = all_words.keys()[:2000]
    for w in MovieList:
        print w['imdb']['title']
        try:
            for genre in w['imdb']['genres']:
                featureSet.append(
                    (getFeatures(w['imdb']['plot'], word_features), genre))
        except:
            print "Error when retriving genre, skipping element"

    return featureSet
于 2013-05-11T12:19:59.383 回答