computer-vision - 使用 lmdb 进行 caffe 多标签训练以对面部区域进行分类

Question

我正在使用两个 lmdb 输入来识别脸部的眼睛、鼻尖和嘴巴区域。数据 lmdb 的维度为Nx3x Hx W，而标签 lmdb 的维度为Nx1x H/4x W/4。标签图像是通过在初始化为全 0 的 opencv Mat 上使用数字 1-4 屏蔽区域来创建的（因此总共有 5 个标签，其中 0 是背景标签）。我将标签图像缩小为相应图像的宽度和高度的 1/4，因为我的网络中有 2 个池化层。这种缩小可确保标签图像尺寸与最后一个卷积层的输出相匹配。

我的 train_val.prototxt：

name: "facial_keypoints"
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TRAIN
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TRAIN
}
data_param {
source: "../train_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TEST
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TEST
}
data_param {
source: "../test_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "images"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv_last"
type: "Convolution"
bottom: "pool2"
top: "conv_last"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 5
pad: 2
kernel_size: 5
stride: 1
weight_filler {
#type: "xavier"
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv_last"
top: "conv_last"
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv_last"
bottom: "labels"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv_last"
bottom: "labels"
top: "loss"
}

在最后一个卷积层中，我将输出大小设置为 5，因为我有 5 个标签类。训练收敛，最终损失约为 0.3，准确度为 0.9（尽管一些消息来源表明这种准确度没有正确测量多标签）。使用经过训练的模型时，输出层正确地生成了一个尺寸为 1x5x H/4x W/4 的 blob，我设法将其可视化为 5 个单独的单通道图像。然而，虽然第一张图像正确突出了背景像素，但其余 4 张图像看起来几乎相同，所有 4 个区域都突出显示。

5 个输出通道的可视化（强度从蓝色增加到红色）：
点击查看图片

原始图像（同心圆标记了每个通道的最高强度。有些更大只是为了与其他通道区分开来。正如您所看到的，除了背景之外，其余通道几乎在同一个嘴巴区域具有最高的激活，这不应该是这种情况。 )
点击查看图片

有人可以帮我找出我犯的错误吗？

谢谢。

score 0 · Accepted Answer

似乎您正面临类别不平衡：您的大多数标记像素都标记为 0（背景），因此，在训练期间，网络几乎不管它“看到”什么都学会了预测背景。由于大多数时候预测背景是正确的，因此训练损失会减少，并且准确度会提高到某个点。
但是，当您实际尝试可视化输出预测时，它主要是背景，而关于其他稀缺标签的信息很少。

在 caffe 中，解决类别不平衡的一种方法是使用“InfogainLoss”层，其权重已调整以抵消标签的不平衡。

computer-vision - 使用 lmdb 进行 caffe 多标签训练以对面部区域进行分类

1 回答 1

Related

Reference