1

我想使用 huggingface 预训练一个 T5 模型。第一步是使用以下代码训练标记器:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer


vocab_size = 32_000
input_sentence_size = None

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")


# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
    if input_sentence_size is None:
        input_sentence_size = len(dataset)
    batch_length = 100
    for i in range(0, input_sentence_size, batch_length):
        yield dataset[i: i + batch_length]["text"]


# Train tokenizer
tokenizer.train_from_iterator(
    iterator=batch_iterator(input_sentence_size=input_sentence_size),
    vocab_size=vocab_size,
    show_progress=True,
)

# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")

对于下载部分,消息是:

Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...

我在 Google Colab Pro 上运行它(使用 High Ram 设置和 TPU)。但是,大约2个小时,执行线仍在运行load_datset

在做什么?load_dataset花这么多时间是正常的吗?我应该中断它并再次运行它吗?

4

0 回答 0