python - Python Maxent 分类器

Question

我一直在 python 中使用 maxent 分类器并且它失败了，我不明白为什么。

我正在使用电影评论语料库。（总菜鸟）

import nltk.classify.util
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
 return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
classifier = MaxentClassifier.train(trainfeats)

这是错误（我知道我做错了，请链接到 Maxent 的工作原理）

警告（来自警告模块）：文件“C:\Python27\lib\site-packages\nltk\classify\maxent.py”，第 1334 行 sum1 = numpy.sum(exp_nf_delta * A, axis=0) RuntimeWarning：遇到无效值乘以

警告（来自警告模块）：文件“C:\Python27\lib\site-packages\nltk\classify\maxent.py”，第 1335 行 sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0) RuntimeWarning：遇到无效值乘以

警告（来自警告模块）：文件“C:\Python27\lib\site-packages\nltk\classify\maxent.py”，第 1341 行 deltas -= (ffreq_empirical - sum1) / -sum2 RuntimeWarning：在除法中遇到无效值

score 6 · Accepted Answer

我更改并更新了代码。

import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation


from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
 return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)

algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)

classifier.show_most_informative_features(10)

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

def word_feats(words):
    return {word:True for word in words if word in top_words}

score 3 · Accepted Answer

溢出问题可能有一个修复，numpy但由于这只是一个用于学习 NLTK/文本分类的电影评论分类器（而且你可能不希望训练花费很长时间），我将提供一个简单的解决方法：你可以只是限制特征集中使用的单词。

您可以300在所有这样的评论中找到最常用的词（如果您愿意，显然可以将其提高），

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

然后，您所要做的就是top_words在特征提取器中交叉引用以进行评论。另外，作为一个建议，使用字典理解而不是将 a listof tuples转换为 a 更有效dict。所以这可能看起来像，

def word_feats(words):
    return {word:True for word in words if word in top_words}

python - Python Maxent 分类器

2 回答 2

Related

Reference