我正在尝试使用 Deberta 执行 NER 分类任务,但我遇到了 Tokenizer 错误。这是我的代码(我的输入句子必须用“,:”逐字分割:):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."])
我有这个结果:
{'input_ids': [[1, 31414, 2], [1, 6, 2], [1, 9226, 2], [1, 354, 2], [1, 1264, 2], [1, 19530, 4086, 2], [1, 44154, 2], [1, 12473, 2], [1, 30938, 2], [1, 4, 2]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]}
然后我继续,但我有这个错误:
tokenized_input = tokenizer(example["tokens"])
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
而且我认为原因是我需要以以下格式获得令牌的结果(这是不可能的,因为我的句子被“,”分割:
tokenizer("Hello, this is one sentence!")
{'input_ids': [1, 31414, 6, 42, 16, 65, 3645, 328, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
所以我尝试了这两种方式,但我很累,不知道该怎么做。关于 Deberta 的在线文档很少。
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)
AssertionError: You need to instantiate DebertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True,add_prefix_space=True)
并且错误仍然相同。太感谢了 !