pytorch - 有没有办法使用 GPU 而不是 CPU 进行 BERT 标记化？

Question

我在大型句子数据集（230 万行，65.3 亿字）上使用 BERT 标记器：

#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

照原样，它在 CPU 上运行，并且仅在 1 个核心上运行。我尝试并行化，但这只会将我的 16 核 CPU 的处理速度提高 16 倍，如果我想标记整个数据集，它仍然可以运行很长时间。

有没有办法让它在 GPU 上运行或以其他方式加速它？

编辑：我也尝试过使用快速标记器：

#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

然后将输出传递给我的 batch_encode_plus：

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

但是 batch_encode_plus 返回以下错误：

TypeError：batch_text_or_text_pairs 必须是一个列表（得到 <class 'numpy.ndarray'>）

pytorch - 有没有办法使用 GPU 而不是 CPU 进行 BERT 标记化？

0 回答 0

Related

Reference