1

我有一个 16 GB 的语料库,我的 ram 大约 16 GB。如果我加载整个数据集以从头开始训练语言模型 RoBERTa,我将遇到内存问题。我打算使用 Huggingface 在他们的博客文章中的教程中提供的脚本训练我的 RoBERTa:https ://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

但是,他们的博客文章建议使用 LineByLineTextDatase。但是,这会急切地加载数据集。

class LineByLineTextDataset(Dataset):
    """
    This will be superseded by a framework-agnostic approach
    soon.
    """

    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

        batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
        self.examples = batch_encoding["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> torch.Tensor:
        return torch.tensor(self.examples[i], dtype=torch.long)

出乎意料的是,我的内核在他们读取行的部分崩溃了。我想知道是否有办法让它懒惰地阅读。如果建议的答案可以通过发布的教程创建最少的代码更改,那将是非常可取的,因为我对 Huggingface 相当陌生,并且担心我无法自己调试它。

4

1 回答 1

1

I would recommend using HuggingFace's own datasets library. The documentation says:

It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.

The quick tour has good explanations and code snippets for creating a dataset object with your own data and it also explains how to train your own model.

于 2020-12-06T10:54:02.267 回答