python - 将数据拆分为训练/测试文件，以便为两个文件选择至少一个样本

Question

我有一个 csv 文件，它被读入数据框。我根据一列的值将其拆分为训练和测试文件。

假设该列称为“类别”，它具有多个类别名称作为列值，例如 cat1、cat2、cat3 等，它们重复多次。

我需要拆分文件，以便每个类别名称在两个文件中至少出现一次。

到目前为止，我能够根据比率将文件分成两部分。我尝试了很多选择，但这是迄今为止最好的选择。

  def executeSplitData(self):
      data = self.readCSV() 
      df = data
      if self.column in data:
         train, test = train_test_split(df, stratify = None, test_size=0.5)
         self.writeTrainFile(train)
         self.writeTestFile(test)

我不完全理解 test_train_split 中的分层选项。请帮忙。谢谢

score 3 · Accepted Answer

我尝试根据文档使用它，但无法开始stratify工作。

设置

from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np

np.random.seed([3,1415])
p = np.arange(1, 5.) / np.arange(1, 5.).sum()
df = pd.DataFrame({'category': np.random.choice(('cat1', 'cat2', 'cat3', 'cat4'), (1000,), p=p),
                   'x': np.random.rand(1000), 'y': np.random.choice(range(2), (1000,))})


def get_freq(s):
    return s.value_counts() / len(s)

print get_freq(df.category)

cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64

如果我尝试：

train, test = train_test_split(df, stratify=df.category, test_size=.5)
train, test = train_test_split(df, stratify=df.category.values, test_size=.5)
train, test = train_test_split(df, stratify=df.category.values.tolist(), test_size=.5)

全部返回：

TypeError：传递的参数无效：

文档说：

分层：类数组或无（默认为无）

我想不出为什么这行不通。

我决定建立一个解决方法：

def stratify_train_test(df, stratifyby, *args, **kwargs):
    train, test = pd.DataFrame(), pd.DataFrame()
    gb = df.groupby(stratifyby)
    for k in gb.groups:
        traink, testk = train_test_split(gb.get_group(k), *args, **kwargs)
        train = pd.concat([train, traink])
        test = pd.concat([test, testk])
    return train, test

train, test = stratify_train_test(df, 'category', test_size=.5)
# this also works
# train, test = stratify_train_test(df, df.category, test_size=.5)

print get_freq(train.category)
print len(train)

Name: category, dtype: float64
cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64
500

print get_freq(test.category)
print len(test)

cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64
500

python - 将数据拆分为训练/测试文件，以便为两个文件选择至少一个样本

1 回答 1

设置

Related

Reference