我尝试用 pysyft 训练图像分类(cifar10)。我的 trainsetup 有 10 个工作人员,每个工作人员获得数据集的 800 到 1200 张图像。
我的问题是,在大约 250-300 个 epoch 之后,训练损失大约为 0.005,模型停止改进,尽管测试准确度仅为 45% 左右,损失增加 1.5 -> 8.5。我在 500 张图像上对 100 名工作人员进行了相同的尝试,结果停止在 32%。此外,实现是模型和 FL 框架之间比较的一部分,因此模型不能更改,数据将在本地加载并转换为 Dataloader。因此,我对 Pytorch 和 PySyft 非常缺乏经验,可能是我在训练模型时犯了一些错误,尽管我试图尽可能地接近这个例子。
我在没有 PySyft 的情况下训练了模型,它达到了大约 85%,所以我认为我的数据加载器和模型应该不是问题。对我来说,看起来工人在训练期间过度拟合了他们自己的数据。
有没有办法防止工人过度拟合或计算全局模型而不是工人的损失?
教练:
def fl_train(args, model, device, federated_train_loader, optimizer, epoch, log):
model.train()
results = []
metrics = []
t1 = time.time()
cel = nn.CrossEntropyLoss()
for batch_idx, (data, target) in enumerate(federated_train_loader): # <-- now it is a distributed dataset
t2 = time.time()
model.send(data.location) # <-- NEW: send the model to the right location
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target.long())
loss.backward()
optimizer.step()
model.get() # <-- NEW: get the model back
if batch_idx % args.log_interval == 0:
loss = loss.get() # <-- NEW: get the loss back
results.append(loss.item())
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * BATCH_SIZE, len(federated_train_loader) * BATCH_SIZE,
100. * batch_idx / len(federated_train_loader), loss.item()))
模型:
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv_layer = nn.Sequential(
# Conv Layer block 1
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3),
nn.ReLU(inplace=True),
nn.MaxPool2d((2,2)),
# Conv Layer block 2
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
nn.ReLU(inplace=True),
nn.MaxPool2d((2,2)),
# Conv Layer block 3
nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3),
nn.ReLU(inplace=True),
)
self.fc_layer = nn.Sequential(
nn.Linear(1024, 64),
nn.ReLU(inplace=True),
nn.Linear(64, 10)
)
def forward(self, x):
# CNN layers
x = self.conv_layer(x)
# flatten
x = x.view(-1, 1024)
# NN layer
x = self.fc_layer(x)
return F.log_softmax(x, dim=1)
主要的:
model = CNN().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.02) # TODO momentum is not supported at the moment
log = {}
for epoch in range(1, args.epochs + 1):
log = fl_train(args, model, device, f_dataloader, optimizer, epoch, log)
if epoch % 20 == 0:
log = test(args, model, device, test_loader, epoch, log)
if epoch % 100 == 0:
store_results(log, model)
日志:
....
Train Epoch: 317 [0/10400 (0%)] Loss: 0.005194
Train Epoch: 317 [3000/10400 (29%)] Loss: 0.003882
Train Epoch: 317 [6000/10400 (58%)] Loss: 0.003100
Train Epoch: 317 [9000/10400 (87%)] Loss: 0.004298
Train Epoch: 318 [0/10400 (0%)] Loss: 0.007426
Train Epoch: 318 [3000/10400 (29%)] Loss: 0.002255
Train Epoch: 318 [6000/10400 (58%)] Loss: 0.003835
Train Epoch: 318 [9000/10400 (87%)] Loss: 0.005277
Train Epoch: 319 [0/10400 (0%)] Loss: 0.006207
Train Epoch: 319 [3000/10400 (29%)] Loss: 0.003562
Train Epoch: 319 [6000/10400 (58%)] Loss: 0.001904
Train Epoch: 319 [9000/10400 (87%)] Loss: 0.002644
Train Epoch: 320 [0/10400 (0%)] Loss: 0.007491
Train Epoch: 320 [3000/10400 (29%)] Loss: 0.003794
Train Epoch: 320 [6000/10400 (58%)] Loss: 0.002643
Train Epoch: 320 [9000/10400 (87%)] Loss: 0.002981
Test set: Average loss: 9.1279, Accuracy: 458/1000 (46%)
Train Epoch: 321 [0/10400 (0%)] Loss: 0.007153
Train Epoch: 321 [3000/10400 (29%)] Loss: 0.004265
Train Epoch: 321 [6000/10400 (58%)] Loss: 0.002708
Train Epoch: 321 [9000/10400 (87%)] Loss: 0.002518
Train Epoch: 322 [0/10400 (0%)] Loss: 0.006285
Train Epoch: 322 [3000/10400 (29%)] Loss: 0.002357
Train Epoch: 322 [6000/10400 (58%)] Loss: 0.002465
Train Epoch: 322 [9000/10400 (87%)] Loss: 0.002406
Train Epoch: 323 [0/10400 (0%)] Loss: 0.005361
Train Epoch: 323 [3000/10400 (29%)] Loss: 0.004807
Train Epoch: 323 [6000/10400 (58%)] Loss: 0.001903
Train Epoch: 323 [9000/10400 (87%)] Loss: 0.003711
Train Epoch: 324 [0/10400 (0%)] Loss: 0.006609
....