pytorch - 微调大型模型时 CUDA 内存不足

Question

我之前分别训练了一个 VGG 模式（比如 model1）和一个两层模型（比如 model2），现在我必须训练一个将这两个模型结合在一起的新模型，并且新模型的每个部分都用所学的初始化model1 和 model2 的权重，我实现如下：

class TransferModel(nn.Module):
    def __init__(self, VGG, TwoLayer):
        super(TransferModel, self).__init__()
        self.vgg_layer=VGG
        self.linear = TwoLayer
        for param in self.vgg_layer.parameters():
            param.requires_grad = True
    def forward(self, x):
        h1_vgg = self.vgg_layer(x)
        y_pred = self.linear(h1_vgg)
        return y_pred
# for image_id in train_ids[0:1]:
#     img = load_image(train_id_to_file[image_id])
new_model=TransferModel(trained_vgg_instance, trained_twolayer_instance)
new_model.linear.load_state_dict(trained_twolayer_instance.state_dict())
new_model.vgg_layer.load_state_dict(trained_vgg_instance.state_dict())
new_model.cuda()

在训练时，我尝试：

def train(model, learning_rate=0.001, batch_size=50, epochs=2):
    optimizer=optim.Adam(model.parameters(), lr=learning_rate)
    criterion = torch.nn.MultiLabelSoftMarginLoss()
    x = torch.zeros([batch_size, 3, img_size, img_size])
    y_true = torch.zeros([batch_size, 4096])
    for epoch in range(epochs):  # loop over the dataset multiple times
        running_loss = 0.0
        shuffled_indcs=torch.randperm(20000)
        for i in range(20000):
        for batch_num in range(int(20000/batch_size)):
            optimizer.zero_grad()
            for j in range(batch_size):
                # ... some code to load batches of images into x....
            x_batch=Variable(x).cuda()
            print(batch_num)
            y_true_batch=Variable(train_labels[batch_num*batch_size:(batch_num+1)*batch_size, :]).cuda()
            y_pred =model(x_batch)
            loss = criterion(y_pred, y_true_batch)
            loss.backward()
            optimizer.step()
            running_loss += loss
            del x_batch, y_true_batch, y_pred
            torch.cuda.empty_cache()
        print("in epoch[%d] = %.8f " % (epoch, running_loss /(batch_num+1)))
        running_loss = 0.0

    print('Finished Training')
train(new_model)

在第一个时期的第二次迭代（batch_num=1）中，我得到了这个错误：

CUDA 内存不足。尝试分配 153.12 MiB（GPU 0；5.93 GiB 总容量；4.83 GiB 已分配；66.94 MiB 空闲；374.12 MiB 缓存）

尽管我在训练中明确使用了“del”，但通过运行 nvidia-smi 看起来它没有做任何事情并且内存没有被释放。

我应该怎么办？

score 0 · Accepted Answer

更改此行：

running_loss += loss

对此：

running_loss += loss.item()

通过添加loss，running_loss您告诉 pytorchloss将该批次的所有梯度保留在内存中，即使您开始对下一批进行训练。Pytorch 认为您可能希望running_loss稍后在多个批次中使用一些较大的损失函数，因此将所有批次的所有梯度（以及因此激活）保留在内存中。

通过添加.item()，您只需将损失作为 python float，而不是torch.FloatTensor. 这个浮点数与 pytorch 图形分离，因此 pytorch 知道你不想要关于它的渐变。

如果您运行的是不带 pytorch 的旧版本.item()，您可以尝试：

running_loss += float(loss).cpu().detach

这也可能是由test()循环中的类似错误引起的，如果您有的话。

pytorch - 微调大型模型时 CUDA 内存不足

1 回答 1

Related

Reference