0

我正在尝试在大型数据集(26k 个样本)上使用类似 GPT2 的模型进行推理。为了加快速度,我想分批进行,但是尝试这样做后,它会在一些批次后进入 Cuda OOM。它仅在某些批次后才消失的事实对我来说听起来很奇怪,因为我认为内存使用在不同批次中应该或多或少保持不变。这是我的代码:

tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

sentences = ["<START_TOK>" + s + "<END_TOK>" + tokenizer.eos_token for s in sentences]

inputs = tokenizer(sentences, return_tensors="pt", padding=True, max_length=1024, truncation=True)

device = torch.device("cuda:0")
inputs = inputs.to(device)
model = model.to(device)
model.eval()
res = []
with torch.no_grad():
    output_sequences = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=1024,
            pad_token_id=tokenizer.eos_token_id,
            no_repeat_ngram_size=2,
            do_sample=True,
            top_k=100,
            top_p=0.9,
            temperature=0.85
        )
     output_sequences = output_sequences.cpu() #not really sure this is useful, just tried, but the problem remained
     for i in range(len(sentences)):
         res.append(tokenizer.decode(output_sequences[i]))
model.train()
return res

可能是什么问题呢?

4

0 回答 0