machine-learning - What is the difference between mteval-v13a.pl and NLTK BLEU?

Question

There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu

But I am not sure if it is the same as the mtevalv13a.pl script.

What is the difference between them?

score 8 · Accepted Answer

TL;博士

在评估机器翻译系统时使用https://github.com/mjpost/sacrebleu 。

简而言之

不，NLTK 中的 BLEU 与mteval-13a.perl.

nltk.translate.corpus_bleu对应于mteval-13a.plngram 的 4 阶，具有一些浮点差异

import nltk
nltk.download('wmt15_eval')

主要区别：

在长

mteval-13a.pl和之间有几个区别nltk.translate.corpus_bleu：

第一个区别是mteval-13a.pl它带有自己的 NIST 标记器，而 BLEU 的 NLTK 版本是度量的实现，并假设输入是预先标记的。
- 顺便说一句，这个正在进行的 PR将弥合 NLTK 和 NIST 标记器之间的差距
另一个主要区别是，当 NLTK BLEU 接收字符串列表的 python 列表时，mteval-13a.pl期望输入是格式的，有关如何将 textfile 转换为 SGM 的更多信息，请参阅 zipball 中的 README.txt。.sgm
mteval-13a.pl期望至少 1-4 的 ngram 顺序。如果句子/语料库的最小 ngram 顺序小于 4，它将返回 0 概率，即 a math.log(float('-inf'))。为了模拟这种行为，NLTK 有一个 put_emulate_multibleu标志：
- 见https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L477
mteval-13a.pl能够生成 NIST 分数，而 NLTK 没有 NIST 分数实施（至少现在还没有）
- NLTK 中的 NIST 分数即将在此 PR中发布

除了差异之外，NLTK BLEU 分数还包含更多功能：

处理原始 BLEU (Papineni, ‎2002) 忽略的边缘案例
- 见https://github.com/nltk/nltk/pull/1383
此外，为了处理 Ngram 的最大阶数 < 4 的边缘情况，单个 ngram 精度的统一权重将被重新加权，使得权重的质量总和为 1.0
- 见https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L175
虽然NIST 有一种用于几何序列平滑的平滑方法，但NLTK 有一个具有相同平滑方法甚至更多平滑方法的等效对象来处理来自 Chen 和 Collin，2014的句子级 BLEU

最后，为了验证 NLTK 版本的 BLEU 中添加的功能，为它们添加了回归测试，请参阅https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py

1 回答 1