1

我正在尝试从 github 运行代码。该文件名为 train.py。它应该运行神经网络来训练数据集。但是,我收到以下错误

(QGN) ubuntu@ip-172-31-13-114:~/QGN$ python train.py
Input arguments:
id               ade20k
arch_encoder     resnet50
arch_decoder     QGN_dense_resnet34
weights_encoder
weights_decoder
fc_dim           2048
list_train       ./data/train_ade20k.odgt
list_val         ./data/validation_ade20k.odgt
root_dataset     ./data/
num_gpus         0
batch_size_per_gpu 2
num_epoch        20
start_epoch      1
epoch_iters      5000
optim            SGD
lr_encoder       0.02
lr_decoder       0.02
lr_pow           0.9
beta1            0.9
weight_decay     0.0001
deep_sup_scale   1.0
prop_weight      2.0
enhance_weight   2.0
fix_bn           0
num_val          500
num_class        150
transform_dict   None
workers          40
imgSize          [300, 375, 450, 525, 600]
imgMaxSize       1000
cropSize         0
padding_constant 32
random_flip      True
seed             1337
ckpt             ./ckpt
disp_iter        20
visualize        False
result           ./result
gpu_id           0
Model ID: ade20k-resnet50-QGN_dense_resnet34-batchSize0-LR_encoder0.02-LR_decoder0.02-epoch20-lossScale1.0-classScale2.0
# samples: 20210
1 Epoch = 5000 iters
Starting Training!
Traceback (most recent call last):
  File "train.py", line 355, in <module>
    main(args)
  File "train.py", line 217, in main
    train(segmentation_module, iterator_train, optimizers, history, epoch, args)
  File "train.py", line 33, in train
    batch_data = next(iterator)
  File "/home/ubuntu/QGN/lib/utils/data/dataloader.py", line 274, in __next__
    raise StopIteration
StopIteration
Segmentation fault (core dumped)

train.py 中的代码(第 211 到 231 行)如下 '''

主循环

history = {'train': {'epoch': [], 'loss': [], 'acc': []}}

print('Starting Training!')

for epoch in range(args.start_epoch, args.num_epoch + 1):
    train(segmentation_module, iterator_train, optimizers, history, epoch, args)

    # checkpointing
    checkpoint(nets, history, args, epoch)

    # evaluation
    args.weights_encoder = os.path.join(args.ckpt, 'encoder_epoch_' + str(epoch) + '.pth')
    args.weights_decoder = os.path.join(args.ckpt, 'decoder_epoch_' + str(epoch) + '.pth')
    iou = eval_train(args)

    # adaptive class weighting
    adjust_crit_weights(segmentation_module, iou, args)


print('Training Done!')

'''

我不确定我是否已经分享了所有必需的信息。如果可以提供 ant 帮助来解决此问题,我将不胜感激。只是为了通知,我已经尝试使用链接https://github.com/amdegroot/ssd.pytorch/issues/214在 github 上共享的 try 和 except 方法。但是错误仍然存​​在。

train.py 中第 30 行的代码如下

   # main loop
    tic = time.time()
    for i in range(args.epoch_iters):
        batch_data = next(iterator)
        data_time.update(time.time() - tic)

        segmentation_module.zero_grad()

我将上面的代码修改如下

   # main loop
     loader_train = torchdata.DataLoader(
        dataset_train,
        batch_size=args.num_gpus,  # we have modified data_parallel
        shuffle=False,  # we do not use this param
        collate_fn=user_scattered_collate,num_workers=int(args.workers),
        drop_last=True,
        pin_memory=True)


    tic = time.time()
    for i in range(args.epoch_iters):
        try:
            batch_data = next(iterator)
        except StopIteration:
            iterator = iter(loader_train)
            batch_data = next(iterator)
        data_time.update(time.time() - tic)

        segmentation_module.zero_grad()

但仍然没有喜悦。错误仍然存​​在。

4

1 回答 1

1

TL;DR
args.epoch_iters的数量大于loader_train. StopIteration当您要求比实际更多的批次时,Python 会引发错误。

当您遍历某些pythonic 元素集合(例如,列表、元组、DataLoader...)时,python 需要知道它何时到达该集合的末尾。这是通过引发StopIteration异常来完成的。forpython 中的循环显式监听此异常并使用它来知道何时停止。唉,在您的代码中,您没有使用for循环loader_train,而是range(args.epoch_iter)使用循环next(iterator)来获取批次。

于 2021-10-11T05:22:41.723 回答