python - 使用 train_test_split 的一个命令创建数据集的多重拆分

Question

我的数据集有42000行
我需要将数据training, cross-validation and test集分成具有60%, 20% and 20%. 这是根据 Andrew Ng 教授在他的 ml-class 讲座中的建议。
我意识到 scikit-learn 有一种方法train_test_split可以做到这一点。但我不能让它工作，这样我就可以像0.6, 0.2, 0.2在一个班轮命令中那样得到分裂

我所做的是

# split data into training, cv and test sets
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)


# preparing the training dataset
print 'training shape(Tuple of array dimensions) = ', train.shape
print 'training dimension(Number of array dimensions) = ', train.ndim
print 'cv shape(Tuple of array dimensions) = ', cv.shape
print 'cv dimension(Number of array dimensions) = ', cv.ndim
print 'test shape(Tuple of array dimensions) = ', test.shape
print 'test dimension(Number of array dimensions) = ', test.ndim

让我得到结果

training shape(Tuple of array dimensions) =  (25200, 785)
training dimension(Number of array dimensions) =  2
cv shape(Tuple of array dimensions) =  (8400, 785)
cv dimension(Number of array dimensions) =  2
test shape(Tuple of array dimensions) =  (8400, 785)
test dimension(Number of array dimensions) =  2
features shape =  (25200, 784)
labels shape =  (25200,)

如何在一个命令中完成这项工作？

score 1 · Accepted Answer

阅读train_test_split及其配套类ShuffleSplit的源代码，并使其适应您的用例。不是很大的功能，应该不会很复杂。

python - 使用 train_test_split 的一个命令创建数据集的多重拆分

1 回答 1

Related

Reference