c++ - 在 FANN 中使用验证数据集训练神经网络

Question

正如一些帖子所暗示的，我开始使用 FANN ( http://leenissen.dk/fann/index.php ) 来做神经网络的事情。它干净且易于理解。

但是，为了避免过拟合问题，我需要采用一种将验证数据集视为辅助的算法。（在神经网络中训练、验证和测试集之间有什么区别？）。有趣的是，FANN 写道，它建议开发人员考虑过拟合问题（http://leenissen.dk/fann/wp/help/advanced-usage/）。

现在的问题是，据我所知，FANN 没有任何功能来支持此功能。FANN 中的训练函数也不提供任何参数来传递验证数据集。我对么？FANN 用户如何使用验证数据集训练他们的神经网络？谢谢你的帮助。

score 0 · Accepted Answer

You can implement this approach, i.e. dataset split, with FANN yourself, but you need to train each epoch separately, using the function fann_train_epoch.

You start with a big dataset, which you then want to split for the different steps. The tricky thing is: You split the dataset only once, and use only the fist part to adjust the weights (training as such).

Say, you want to have already your 2 datasets: Tran and Validation (like in the example you posted). You first need to store them in different files or arrays. Then, you can do the follwing:

struct fann *ann;
struct fann_train_data *dataTrain;
struct fann_train_data *dataVal;

Assuming that you have both datasets in files:

dataTrain = fann_read_train_from_file("./train.data");
dataVal = fann_read_train_from_file("./val.data");

Then, after setting all network parameters, you train and check the error on the second dataset, one epoch at a time. This is something like:

for(i = 1 ; i <= max_epochs ; i++) {
    fann_train_epoch(ann, dataTrain);
    train_error = fann_test_data(ann, dataTrain);
    val_error = fann_test_data(ann, dataVal);
    if ( val_error > last_val_error )
        break;
    last_val_error = val_error;
}

Of course, this condition is too simple and may stop your training loop too early, if the error fluctuate (as it commonly does: look plot below), but you get the general idea on how to use different datasets during training.

By the way, you may want to save these errors to plot them against the training epoch and have a look after the training ended:

score 0 · Accepted Answer

您必须自己将数据拆分为训练数据集和交叉验证数据集。fann_subset_train_data您可以通过创建单独的输入文件或使用像( ref )这样的内置函数来做到这一点。

一旦你有了这两个数据集，你就可以使用你的训练数据集以任何你喜欢的方式训练你的神经网络。然后，您通过将训练数据传递给fann_test_data( ref ) 来获得训练错误，并通过将交叉验证数据传递给来获得交叉验证错误fann_test_data。请注意，此函数计算均方误差 (MSE)。

注意：用户永远不会使用交叉验证数据集训练他们的神经网络——交叉验证数据集仅用于测试！

score 0 · Accepted Answer

一般来说，您应该使用验证子集（通常是数据的 1/5）来进行模型选择和网络架构。测试子集（也是数据的 1/5）用于报告错误。应该这样做以避免报告由于用于网络架构设计的相同数据导致的错误。您可以使用其余数据进行训练，但在找到模型后，您应该绘制学习曲线进行错误诊断。这样做可以减少训练数据，以便更好地概括而不是过度拟合。节点和隐藏层的数量也可以这样做。

score 0 · Accepted Answer

有训练集错误和验证集错误。您在不同的时期和批次上进行训练，然后每次将训练和验证之间的结果结合起来。

当训练误差较低而验证误差较高时，意味着您正在过度拟合。您需要做一些实验并重复，直到您的最佳模型适合您的数据并且不会过度拟合训练集。你可能有兴趣阅读这篇论文。

防止交叉验证数据的“过度拟合”

c++ - 在 FANN 中使用验证数据集训练神经网络

4 回答 4

Related

Reference