4

I am using TrainTestSplit in ML.NET, to repeatedly split my data set into a training and test set. In e.g. sklearn, the corresponding function takes a seed as an input, so that it is possible to obtain different splits, but in ML.NET repeated calls to TrainTestSplit seems to return the same split. Is it possible to change the random seed used by TrainTestSplit?

4

2 回答 2

4

Right now TrainTestSplit doesn't take a random seed. There is a bug opened in ML.NET to fix this: https://github.com/dotnet/machinelearning/issues/1635

As a short-term workaround, I recommend manually adding a random column to the data view, and using it as a stratificationColumn in TrainTestSplit:

data = new GenerateNumberTransform(mlContext,  new GenerateNumberTransform.Arguments
                {
                    Column = new[] { new GenerateNumberTransform.Column { Name = "random" } },
                    Seed = 42 // change seed to get a different split
                }, data);
(var train, var test) = mlContext.Regression.TrainTestSplit(data, stratificationColumn: "random");

This code will work with ML.NET 0.7, and we will fix the seed in 0.8.

于 2018-11-18T04:11:24.673 回答
3

As of today (ML.NET v1.0), this has been solved. TrainTestSplit takes a seed as input, and it also supports stratification by setting samplingKeyColumnName:

TrainTestSplit(IDataView data, double testFraction = 0.1, string samplingKeyColumnName = null, Nullable<int> seed = null);
于 2019-05-15T13:18:04.217 回答