1

正如标题所说,我想知道 sklearnGroupKFoldGroupShuffleSplit.

两者都为具有组 ID 的数据进行训练测试拆分,因此这些组不会在拆分中分离。我检查了每个函数的一个训练集/测试集,它们看起来都做了一个很好的分层,但如果有人能确认所有拆分都这样做,那就太好了。

我对两者进行了测试,分为 10 次:

gss = GroupShuffleSplit(n_splits=10, train_size=0.8, random_state=42)

 

for train_idx, test_idx in gss.split(X,y,groups):

    print("train:", train_idx, "test:", test_idx)

train: [ 1  2  3  4  5 11 12 13 14 15 16 17 19 20] test: [ 0  6  7  8  9 10 18]

train: [ 1  2  3  4  5  6  7  8  9 10 12 13 14 18 19 20] test: [ 0 11 15 16 17]

train: [ 0  1  3  4  5  6  7  8  9 10 12 13 14 18 19 20] test: [ 2 11 15 16 17]

train: [ 0  2  3  4 11 12 13 14 15 16 17 18 19 20] test: [ 1  5  6  7  8  9 10]

train: [ 0  1  3  4  5  6  7  8  9 10 11 15 16 17 19 20] test: [ 2 12 13 14 18]

train: [ 1  2  3  4  5  6  7  8  9 10 11 15 16 17 18] test: [ 0 12 13 14 19 20]

train: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17] test: [ 5 18 19 20]

train: [ 0  1  3  4  6  7  8  9 10 11 15 16 17 18 19 20] test: [ 2  5 12 13 14]

train: [ 0  1  3  4  5 12 13 14 15 16 17 18 19 20] test: [ 2  6  7  8  9 10 11]

train: [ 0  2  3  4  5 11 12 13 14 15 16 17 19 20] test: [ 1  6  7  8  9 10 18]

 

group_kfold = GroupKFold(n_splits=10)

 

for train_idx, test_idx in group_kfold.split(X,y,groups):

    print("train:", train_idx, "test:", test_idx)

train: [ 0  1  2  3  4  5 11 12 13 14 15 16 17 18 19 20] test: [ 6  7  8  9 10]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 18 19 20] test: [15 16 17]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 15 16 17 18 19 20] test: [12 13 14]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] test: [19 20]

train: [ 0  1  2  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [3 4]

train: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 19 20] test: [ 0 18]

train: [ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19 20] test: [11]

train: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [5]

train: [ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [2]

train: [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [1]
4

2 回答 2

3

un-Group 版本的文档更清楚地说明了这一点。KFold 拆分为 k 个折叠,然后将它们集中到不同的训练/测试拆分中,而 ShuffleSplit 重复直接进行训练/测试拆分。特别是,每个样本在 KFold 中只测试一次,但可以在 ShuffleSplit 中测试零次或多次。

于 2020-08-21T00:47:26.267 回答
0

scikit learn here的文档很好地概述了这两种方法的工作原理。这里有一个插图,可以简化对同一文档的理解。

分层洗牌拆分GroupKFold

于 2021-01-02T11:50:02.023 回答