machine-learning - 如何获取火炬中心模型生成的翻译的对齐或注意信息？

Question

火炬中心提供预训练模型，例如：https ://pytorch.org/hub/pytorch_fairseq_translation/

这些模型可以在 python 中使用，或者与 CLI 交互。--print-alignment使用 CLI，可以使用标志进行对齐。安装fairseq（和 pytorch）后，以下代码在终端中工作

curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
MODEL_DIR=wmt14.en-fr.fconv-py
fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --beam 5 --source-lang en --target-lang fr \
    --tokenizer moses \
    --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes \ 
    --print-alignment

在 python 中，可以指定关键字 argsverbose和print_alignment：

import torch

en2fr = torch.hub.load('pytorch/fairseq', 'transformer.wmt14.en-fr', tokenizer='moses', bpe='subword_nmt')

fr = en2fr.translate('Hello world!', beam=5, verbose=True, print_alignment=True)

但是，这只会将对齐输出为日志消息。而对于 fairseq 0.9，它似乎被破坏并导致错误消息（问题）。

有没有办法从 python 代码访问对齐信息（甚至可能是完整的注意力矩阵）？

score 2 · Accepted Answer

我浏览了 fairseq 代码库，发现了一种输出对齐信息的 hacky 方法。由于这需要编辑 fairseq 源代码本身，我不认为这是一个可以接受的解决方案。但也许它可以帮助某人（我仍然对如何正确执行此操作的答案非常感兴趣）。

编辑sample() 函数并重写 return 语句。这是整个函数（为了帮助您更好地在代码中找到它），但只应更改最后一行：

def sample(self, sentences: List[str], beam: int = 1, verbose: bool = False, **kwargs) -> List[str]:
    if isinstance(sentences, str):
        return self.sample([sentences], beam=beam, verbose=verbose, **kwargs)[0]
    tokenized_sentences = [self.encode(sentence) for sentence in sentences]
    batched_hypos = self.generate(tokenized_sentences, beam, verbose, **kwargs)
    return list(zip([self.decode(hypos[0]['tokens']) for hypos in batched_hypos], [hypos[0]['alignment'] for hypos in batched_hypos]))

machine-learning - 如何获取火炬中心模型生成的翻译的对齐或注意信息？

1 回答 1

Related

Reference