我正在尝试在大型数据集(26k 个样本)上使用类似 GPT2 的模型进行推理。为了加快速度,我想分批进行,但是尝试这样做后,它会在一些批次后进入 Cuda OOM。它仅在某些批次后才消失的事实对我来说听起来很奇怪,因为我认为内存使用在不同批次中应该或多或少保持不变。这是我的代码:
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
sentences = ["<START_TOK>" + s + "<END_TOK>" + tokenizer.eos_token for s in sentences]
inputs = tokenizer(sentences, return_tensors="pt", padding=True, max_length=1024, truncation=True)
device = torch.device("cuda:0")
inputs = inputs.to(device)
model = model.to(device)
model.eval()
res = []
with torch.no_grad():
output_sequences = model.generate(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
max_length=1024,
pad_token_id=tokenizer.eos_token_id,
no_repeat_ngram_size=2,
do_sample=True,
top_k=100,
top_p=0.9,
temperature=0.85
)
output_sequences = output_sequences.cpu() #not really sure this is useful, just tried, but the problem remained
for i in range(len(sentences)):
res.append(tokenizer.decode(output_sequences[i]))
model.train()
return res
可能是什么问题呢?