1

让我们考虑两个句子:

"why isn't Alex's text tokenizing? The house on the left is the Smiths' house"

现在让我们标记和解码:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")))

我们得到:

"why isn't alex's text tokenizing? the house on the left is the smiths'house"

我的问题是如何处理smiths'house等所有格中缺少的空间?

对我来说,Transformers 中的标记化过程似乎做得不对。让我们考虑输出

tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")

我们得到:

['why', 'isn', "'", 't', 'alex', "'", 's', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "'", 'house']

所以在这一步中,我们已经丢失了关于最后一个撇号的重要信息。如果以另一种方式进行标记化会更好:

['why', 'isn', "##'", '##t', 'alex', "##'", '##s', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "##'", 'house']

通过这种方式,标记化保留了有关撇号的所有信息,并且我们不会遇到所有格的问题。

4

0 回答 0