1

我正在尝试将 sentence_bleu 应用于 Pandas 中的列,以评估某些机器翻译的质量。但是它输出的分数是不正确的。谁能看到我的错误?

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu

translations = {
    'reference': [['this', 'is', 'a', 'test'],['this', 'is', 'a', 'test'],['this', 'is', 'a', 'test']],
    'candidate': [['this', 'is', 'a', 'test'],['this', 'is', 'not','a', 'quiz'],['I', 'like', 'kitties', '.']]
}
df = pd.DataFrame(translations)

df['BLEU'] = df.apply(lambda row: sentence_bleu(row['reference'],row['candidate']), axis=1)
df

它输出这个:

Index   reference   candidate   BLEU
0   [this, is, a, test] [this, is, a, test] 1.288230e-231
1   [this, is, a, test] [this, is, not, a, quiz]    1.218332e-231
2   [this, is, a, test] [I, like, kitties, .]   0.000000e+00

第 0 行应等于 1.0,第 1 行应小于 1.0。大概在0.9左右。我究竟做错了什么?

4

1 回答 1

1

您当前正在比较列表中的字符串。由于这些字符串仅包含单个单词,因此分数会将 n > 1 的所有 n-gram 直接评为 0。

相反,您希望您的参考是['this is a test'](基本事实参考列表),而候选人是'this is a test'(单个候选人)。

from nltk.translate.bleu_score import sentence_bleu

translations = {
    'reference': [['this is a test'],['this is a test'],['this is a test']],
    'candidate': ['this is a test','this is not a test','I like kitties']
}
df = pd.DataFrame(translations)

df['BLEU'] = df.apply(lambda row: sentence_bleu(row['reference'],row['candidate']), axis=1)
df

结果是:

          reference           candidate           BLEU
0  [this is a test]      this is a test   1.000000e+00
1  [this is a test]  this is not a test   7.037906e-01
2  [this is a test]      I like kitties  6.830097e-155
于 2019-06-15T12:12:26.447 回答