python - 如何取消 BERT 代币？

Question

我有一个句子，我需要将 N BERT tokens 对应的文本返回到特定单词的左侧和右侧。

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

我想要的是在标记马德里的左侧和右侧获取对应于 4 个标记的文本。所以我想要令牌： ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] 然后将它们转换成原文。在这种情况下，它将是“马德里自然科学博物馆展示 REC”。

有没有办法做到这一点？

score 2 · Accepted Answer

除了Jindrich提供的关于信息丢失的信息之外，我想补充一点，huggingface 提供了一个内置方法来将标记转换为字符串（丢失的信息仍然丢失！）。该方法称为convert_tokens_to_string：

tz.convert_tokens_to_string(tokens[1:10])

输出：

'Natural Science Museum of Madrid shows the REC'

score 2 · Accepted Answer

BERT 使用词片标记化，不幸的是它不是无损的，也就是说，你永远无法保证在去标记化后得到相同的句子。这与使用完全可恢复的 SentencePiece 的 RoBERTa 有很大不同。

您可以获得所谓的预标记文本，其中合并以 . 开头的标记##。

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

此代码段重建您示例中的句子，但请注意，如果句子包含标点符号，则标点符号将与其他标记保持分离，这是预标记。句子可以是这样的：

'This is a sentence ( with brackets ) .'

从预标记到标准句子是有损步骤（您永远无法知道原始句子中是否以及有多少额外的空格）。您可以通过应用去标记化规则来获得标准句子，例如在sacremoses中。

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

这导致：

'This is a sentence (with brackets).'

python - 如何取消 BERT 代币？

2 回答 2

Related

Reference