我正在尝试使用 Sentencepiece 使用我自己的数据集/词汇创建自己的标记器,然后将其与 AlbertTokenizer 转换器一起使用。
我非常密切关注 HuggingFace 关于如何从头开始训练模型的教程:https ://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=hO5M3vrAhcuj
# import relevant libraries
from pathlib import Path
from tokenizers import SentencePieceBPETokenizer
from tokenizers.implementations import SentencePieceBPETokenizer
from tokenizers.processors import BertProcessing
from transformers import AlbertTokenizer
paths = [str(x) for x in Path("./data").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)
# Customize training
tokenizer.train(files=paths,
vocab_size=32000,
min_frequency=2,
show_progress=True,
special_tokens=['<unk>'],)
# Saving model
tokenizer.save_model("Sent-AlBERT")
tokenizer = SentencePieceBPETokenizer(
"./Sent-AlBERT/vocab.json",
"./Sent-AlBERT/merges.txt",)
tokenizer.enable_truncation(max_length=512)
一切都很好,直到我尝试在转换器中重新创建标记器
# Re-create our tokenizer in transformers
tokenizer = AlbertTokenizer.from_pretrained("./Sent-AlBERT", do_lower_case=True)
这是我不断收到的错误消息:
OSError: Model name './Sent-AlBERT' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed './Sent-AlBERT' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
出于某种原因,它适用于 RobertaTokenizerFast,但不适用于 AlbertTokenzier。
如果有人可以就如何将 Sentencepiece 与 AlberTokenizer 一起使用给我一个建议或任何指示,我将不胜感激。