2

我尝试用 pysyft 训练图像分类(cifar10)。我的 trainsetup 有 10 个工作人员,每个工作人员获得数据集的 800 到 1200 张图像。

我的问题是,在大约 250-300 个 epoch 之后,训练损失大约为 0.005,模型停止改进,尽管测试准确度仅为 45% 左右,损失增加 1.5 -> 8.5。我在 500 张图像上对 100 名工作人员进行了相同的尝试,结果停止在 32%。此外,实现是模型和 FL 框架之间比较的一部分,因此模型不能更改,数据将在本地加载并转换为 Dataloader。因此,我对 Pytorch 和 PySyft 非常缺乏经验,可能是我在训练模型时犯了一些错误,尽管我试图尽可能地接近这个例子。

我在没有 PySyft 的情况下训练了模型,它达到了大约 85%,所以我认为我的数据加载器和模型应该不是问题。对我来说,看起来工人在训练期间过度拟合了他们自己的数据。

有没有办法防止工人过度拟合或计算全局模型而不是工人的损失?

教练:

    
def fl_train(args, model, device, federated_train_loader, optimizer, epoch, log):
    model.train()
    results = []
    metrics = []
    t1 = time.time()
    cel = nn.CrossEntropyLoss()
    for batch_idx, (data, target) in enumerate(federated_train_loader): # <-- now it is a distributed dataset
        t2 = time.time()
        model.send(data.location) # <-- NEW: send the model to the right location
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target.long())
        loss.backward()
        optimizer.step()
        model.get() # <-- NEW: get the model back
        if batch_idx % args.log_interval == 0:
            loss = loss.get() # <-- NEW: get the loss back
            results.append(loss.item())
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * BATCH_SIZE, len(federated_train_loader) * BATCH_SIZE,
                100. * batch_idx / len(federated_train_loader), loss.item()))

模型:

class CNN(nn.Module):

    def __init__(self):
        super(CNN, self).__init__()

        self.conv_layer = nn.Sequential(

            # Conv Layer block 1
            nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3),
            nn.ReLU(inplace=True),
            nn.MaxPool2d((2,2)),

            # Conv Layer block 2
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
            nn.ReLU(inplace=True),
            nn.MaxPool2d((2,2)),

            # Conv Layer block 3
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3),
            nn.ReLU(inplace=True),
        )

        self.fc_layer = nn.Sequential(
            nn.Linear(1024, 64),
            nn.ReLU(inplace=True),
            nn.Linear(64, 10)
        )


    def forward(self, x):
        # CNN layers
        x = self.conv_layer(x)

        # flatten
        x = x.view(-1, 1024)

        # NN layer
        x = self.fc_layer(x)
        return F.log_softmax(x, dim=1)

主要的:

model = CNN().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.02) # TODO momentum is not supported at the moment
log = {}
for epoch in range(1, args.epochs + 1):
    log = fl_train(args, model, device, f_dataloader, optimizer, epoch, log)
    if epoch % 20 == 0:
      log = test(args, model, device, test_loader, epoch, log)
    if epoch % 100 == 0:
      store_results(log, model)

日志:

....
Train Epoch: 317 [0/10400 (0%)] Loss: 0.005194
Train Epoch: 317 [3000/10400 (29%)] Loss: 0.003882
Train Epoch: 317 [6000/10400 (58%)] Loss: 0.003100
Train Epoch: 317 [9000/10400 (87%)] Loss: 0.004298
Train Epoch: 318 [0/10400 (0%)] Loss: 0.007426
Train Epoch: 318 [3000/10400 (29%)] Loss: 0.002255
Train Epoch: 318 [6000/10400 (58%)] Loss: 0.003835
Train Epoch: 318 [9000/10400 (87%)] Loss: 0.005277
Train Epoch: 319 [0/10400 (0%)] Loss: 0.006207
Train Epoch: 319 [3000/10400 (29%)] Loss: 0.003562
Train Epoch: 319 [6000/10400 (58%)] Loss: 0.001904
Train Epoch: 319 [9000/10400 (87%)] Loss: 0.002644
Train Epoch: 320 [0/10400 (0%)] Loss: 0.007491
Train Epoch: 320 [3000/10400 (29%)] Loss: 0.003794
Train Epoch: 320 [6000/10400 (58%)] Loss: 0.002643
Train Epoch: 320 [9000/10400 (87%)] Loss: 0.002981
Test set: Average loss: 9.1279, Accuracy: 458/1000 (46%)

Train Epoch: 321 [0/10400 (0%)] Loss: 0.007153
Train Epoch: 321 [3000/10400 (29%)] Loss: 0.004265
Train Epoch: 321 [6000/10400 (58%)] Loss: 0.002708
Train Epoch: 321 [9000/10400 (87%)] Loss: 0.002518
Train Epoch: 322 [0/10400 (0%)] Loss: 0.006285
Train Epoch: 322 [3000/10400 (29%)] Loss: 0.002357
Train Epoch: 322 [6000/10400 (58%)] Loss: 0.002465
Train Epoch: 322 [9000/10400 (87%)] Loss: 0.002406
Train Epoch: 323 [0/10400 (0%)] Loss: 0.005361
Train Epoch: 323 [3000/10400 (29%)] Loss: 0.004807
Train Epoch: 323 [6000/10400 (58%)] Loss: 0.001903
Train Epoch: 323 [9000/10400 (87%)] Loss: 0.003711
Train Epoch: 324 [0/10400 (0%)] Loss: 0.006609
....
4

0 回答 0