0

我有两个数据集并使用 CNN 的 caffe 库进行训练。

第一个数据集有很多训练数据,超过 60,000 个训练图像和 16,000 个测试图像。它的求解器文件如下所示。训练中的批量大小设置为 32。

train_net: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/train_0.prototxt"
test_net: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/test_0.prototxt"
test_iter: 2080
test_interval: 4000
base_lr: 0.0010000000475
display: 10
max_iter: 16000
lr_policy: "multistep"
gamma: 0.10000000149
momentum: 0.899999976158
weight_decay: 0.000500000023749
snapshot: 2000
snapshot_prefix: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/InceptionNet"
solver_mode: GPU
device_id: 0
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 4000
stepvalue: 8000
stepvalue: 12000
iter_size: 1
momentum2: 0.999000012875
type: "Adam"
eval_type: "detection"
ap_version: "11point"
num_total_train_images: 62308
pathtolog: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6"
batchsize: 32
meanprecision: 0.5
scratch: 1

我有第二个数据集,火车图像数量较少。只有 2883 个训练图像和 709 个测试图像,训练的批量大小设置为 16,如下所示。

train_net: "/home /Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/train_0.prototxt"
test_net: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/test_0.prototxt"
test_iter: 177
test_interval: 500
base_lr: 0.0010000000475
display: 10
max_iter: 8000
lr_policy: "multistep"
gamma: 0.10000000149
momentum: 0.899999976158
weight_decay: 0.000500000023749
snapshot: 1000
snapshot_prefix: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/InceptionNet"
solver_mode: GPU
device_id: 0
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 2000
stepvalue: 4000
stepvalue: 6000
iter_size: 1
momentum2: 0.999000012875
type: "Adam"
eval_type: "detection"
ap_version: "11point"
num_total_train_images: 2883
pathtolog: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0"
batchsize: 16
meanprecision: 0.5
scratch: 1

我在具有相同 GPU 和资源的同一台 PC 上进行了培训。第二个数据集给了我"Check failed: error == cudaSuccess (74 vs. 0) misaligned address" 但第一个数据集已成功训练。有什么问题?

4

1 回答 1

0

这是 Caffe 中的内部错误,因为在某些情况下 max_workspace 不是 16 的倍数,这会导致工作空间在内存中未对齐。我会尝试的第一件事是更改批量大小。

在这里您可以看到有问题的拉取请求: https ://github.com/BVLC/caffe/pull/6548

于 2019-06-13T10:41:27.207 回答