python-2.7 - NLTK 包估计（unigram）困惑

Question

我正在尝试计算我拥有的数据的困惑度。我正在使用的代码是：

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

但我收到错误，

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

我已经对我拥有的数据执行了潜在狄利克雷分配，并且我已经生成了一元组及其各自的概率（它们被归一化为数据的总概率之和为 1）。

我的 unigrams 和它们的概率看起来像：

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

这只是我拥有的 unigrams 文件的一个片段。大约 1000 行遵循相同的格式。总概率（第二列）相加得出 1。

我是一个初出茅庐的程序员。这个 ngram.py 属于 nltk 包，我对如何纠正这一点感到困惑。我这里的示例代码来自 nltk 文档，我现在不知道该怎么做。请帮助我能做些什么。提前致谢！

score 22 · Accepted Answer

Perplexity 是测试集的逆概率，由单词数归一化。在单字的情况下：

现在你说你已经构建了 unigram 模型，意思是，对于每个单词你都有相关的概率。然后你只需要应用公式。我假设您有一本大字典unigram[word]，可以提供语料库中每个单词的概率。您还需要有一个测试集。如果您的 unigram 模型不是字典的形式，请告诉我您使用了什么数据结构，以便我可以相应地调整它以适应我的解决方案。

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

更新：

正如您要求提供一个完整的工作示例，这是一个非常简单的示例。

假设这是我们的语料库：

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

下面是我们首先构建 unigram 模型的方法：

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

我们这里的模型是平滑的。对于超出其知识范围的单词，它分配的概率很低0.01。我已经告诉过你如何计算困惑度：

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

现在我们可以在两个不同的测试集上进行测试：

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

你得到以下结果：

>>> 
49.09452736318415
99.99999999999997

请注意，在处理困惑时，我们会尽量减少它。对于某个测试集具有较少困惑的语言模型比具有更大困惑的语言模型更可取。在第一个测试集中，单词Monty被包含在 unigram 模型中，因此 perplexity 的相应数量也较小。

score 0 · Accepted Answer

感谢您的代码片段！不应该：

for word in model:
        model[word] = model[word]/float(sum(model.values()))

更确切地说：

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

哦...我看到已经回答了...

python-2.7 - NLTK 包估计（unigram）困惑

2 回答 2

Related

Reference