python - NLTK：语料库级别的 BLEU 与句子级别的 BLEU 分数

Question

我在 python 中导入了 nltk 来计算 Ubuntu 上的 BLEU 分数。我了解句子级 BLEU 评分的工作原理，但我不了解语料库级 BLEU 评分的工作原理。

以下是我的语料库级 BLEU 分数代码：

import nltk

hypothesis = ['This', 'is', 'cat'] 
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1])
print(BLEUscore)

出于某种原因，上述代码的 bleu 分数为 0。我期望语料库级别的 BLEU 分数至少为 0.5。

这是我的句子级 BLEU 分数代码

import nltk

hypothesis = ['This', 'is', 'cat'] 
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = [1])
print(BLEUscore)

这里的句子级 BLEU 分数是我期望的 0.71，考虑到简洁性惩罚和缺失的单词“a”。但是，我不明白语料库级别的 BLEU 分数是如何工作的。

任何帮助，将不胜感激。

score 33 · Accepted Answer

TL;博士：

>>> import nltk
>>> hypothesis = ['This', 'is', 'cat'] 
>>> reference = ['This', 'is', 'a', 'cat']
>>> references = [reference] # list of references for 1 sentence.
>>> list_of_references = [references] # list of references for all sentences in corpus.
>>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
>>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses)
0.6025286104785453
>>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
0.6025286104785453

（注意：您必须在分支上拉取最新版本的 NLTKdevelop才能获得稳定版本的 BLEU 分数实现）

在长：

实际上，如果整个语料库中只有一个参考和一个假设，那么两者corpus_bleu()和sentence_bleu()都应该返回相同的值，如上例所示。

在代码中，我们看到它sentence_bleu实际上是一个鸭子类型corpus_bleu：

def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                  smoothing_function=None):
    return corpus_bleu([references], [hypothesis], weights, smoothing_function)

如果我们查看以下参数sentence_bleu：

 def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                      smoothing_function=None):
    """"
    :param references: reference sentences
    :type references: list(list(str))
    :param hypothesis: a hypothesis sentence
    :type hypothesis: list(str)
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The sentence-level BLEU score.
    :rtype: float
    """

的引用的输入sentence_bleu是 a list(list(str))。

因此，如果您有一个句子字符串，例如"This is a cat"，您必须对其进行标记以获得字符串列表，["This", "is", "a", "cat"]并且由于它允许多个引用，因此它必须是字符串列表列表，例如，如果您有第二个引用，“这是一只猫”，您的输入sentence_bleu()将是：

references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)

当涉及到corpus_bleu()list_of_references 参数时，它基本上是一个包含任何sentence_bleu()引用的列表：

def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
                smoothing_function=None):
    """
    :param references: a corpus of lists of reference sentences, w.r.t. hypotheses
    :type references: list(list(list(str)))
    :param hypotheses: a list of hypothesis sentences
    :type hypotheses: list(list(str))
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The corpus-level BLEU score.
    :rtype: float
    """

除了查看 .doctest 中的 doctest nltk/translate/bleu_score.py，您还可以查看 unittest atnltk/test/unit/translate/test_bleu_score.py以了解如何使用bleu_score.py.

顺便说一句，由于是在 ( ]( https://github.com/nltk/nltk/blob/develop/nltk/translate/init .py #L21sentence_bleu )中导入的，因此使用bleunltk.translate.__init__.py

from nltk.translate import bleu

将与以下内容相同：

from nltk.translate.bleu_score import sentence_bleu

在代码中：

>>> from nltk.translate import bleu
>>> from nltk.translate.bleu_score import sentence_bleu
>>> from nltk.translate.bleu_score import corpus_bleu
>>> bleu == sentence_bleu
True
>>> bleu == corpus_bleu
False

score 6 · Accepted Answer

让我们来看看：

>>> help(nltk.translate.bleu_score.corpus_bleu)
Help on function corpus_bleu in module nltk.translate.bleu_score:

corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None)
    Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all 
    the hypotheses and their respective references.  

    Instead of averaging the sentence level BLEU scores (i.e. marco-average 
    precision), the original BLEU metric (Papineni et al. 2002) accounts for 
    the micro-average precision (i.e. summing the numerators and denominators
    for each hypothesis-reference(s) pairs before the division).
    ...

你比我更能理解算法的描述，所以我不会试图向你“解释”它。如果文档字符串不够清楚，请查看源代码本身。或者在本地找到它：

>>> nltk.translate.bleu_score.__file__
'.../lib/python3.4/site-packages/nltk/translate/bleu_score.py'

python - NLTK：语料库级别的 BLEU 与句子级别的 BLEU 分数

2 回答 2

Related

Reference