10

train_test_split我正在尝试使用scikit-learn中的函数将我的数据集拆分为训练集和测试集,但出现此错误:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

但是,所有类都至少有 15 个样本。为什么我会收到此错误?

X 是一个表示数据点的 pandas DataFrame,y 是一个 pandas DataFrame,其中一列包含目标变量。

我不能发布原始数据,因为它是专有的,但是通过创建一个具有 1k 行 x 500 列的随机 pandas DataFrame (X) 和一个具有相同行数 (1k) X 的随机 pandas DataFrame (y) 是相当可复制的,并且,对于每一行,目标变量(分类标签)。y pandas DataFrame 应该有不同的分类标签(例如 'class1'、'class2'...),并且每个标签应该至少出现 15 次。

4

9 回答 9

7

问题是它train_test_split需要 2 个数组作为输入,但该y数组是一个单列矩阵。如果我只通过它的第一列y

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])
于 2017-04-03T09:36:11.503 回答
3

要点是,如果您使用分层 CV,那么如果拆分数量无法生成所有 CV 拆分,且数据中所有类的比率相同,您将收到此警告。例如,如果您有一个类的 2 个样本,则将有 2 个 CV 集和 2 个此类样本,以及 3 个 CV 集和 0 个样本,因此此类的比率样本在所有 CV 集中不相等。但问题仅在于任何一组中有 0 个样本,因此如果您的样本至少与 CV 拆分的数量一样多,即在这种情况下为 5,则不会出现此警告。

请参阅https://stackoverflow.com/a/48314533/2340939

于 2020-06-25T20:02:57.113 回答
1

继续user2340939 的回答。如果您确实需要对训练测试拆分进行分层,尽管某个类中的行数较少,您可以尝试使用以下方法。我通常使用相同的方法,将此类类的所有行复制到训练数据集和测试数据集。

from sklearn.model_selection import train_test_split

def get_min_required_rows(test_size=0.2):
    return 1 / test_size

def make_stratified_splits(df, y_col="label", test_size=0.2):
    """
        for any class with rows less than min_required_rows corresponding to the input test_size,
        all the rows associated with the specific class will have a copy in both the train and test splits.
        
        example: if test_size is 0.2 (20% otherwise),
        min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
        where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
    """
    
    id_col = "id"
    temp_col = "same-class-rows"
    
    class_to_counts = df[y_col].value_counts()
    df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
    
    min_required_rows = get_min_required_rows(test_size)
    copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
    valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
    
    X = valid_rows[id_col].tolist()
    y = valid_rows[y_col].tolist()
    
    # notice, this train_test_split is a stratified split
    X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
    
    X_test = X_test + copy_rows[id_col].tolist()
    X_train = X_train + copy_rows[id_col].tolist()
    
    df.drop([temp_col], axis=1, inplace=True)
    
    test_df = df[df[id_col].isin(X_test)].copy(deep=True)
    train_df = df[df[id_col].isin(X_train)].copy(deep=True)
    
    print (f"number of rows in the original dataset: {len(df)}")
    
    test_prop = round(len(test_df) / len(df) * 100, 2)
    train_prop = round(len(train_df) / len(df) * 100, 2)
    print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
    
    return train_df, test_df
于 2021-05-11T12:19:03.920 回答
0
from sklearn.model_selection import train_test_split

all_keys = df['Key'].unique().tolist()

t_df = pd.DataFrame()
c_df = pd.DataFrame()

for key in all_keys:
    print(key)
    if df.loc[df['Key']==key].shape[0] < 2 :
        t_df = t_df.append(df.loc[df['Key']==key])
    else:
        df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
        t_df = t_df.append(df_t)
        c_df = c_df.append(df_c) 
于 2022-02-10T15:22:04.293 回答
0

我遇到了这个问题,因为我要拆分的一些东西是列表,有些是数组。当我将数组转换为列表时,它起作用了。

于 2021-09-06T23:27:44.280 回答
0

我也有同样的问题。有些班级只有一两个项目。(我的问题是多班级问题)。您可以删除或联合具有较少项目的类。我这样解决我的问题。

于 2021-11-23T13:29:00.967 回答
-1

删除分层。

stratify=y

应该只在分类问题的情况下使用,以便各种输出类(比如“好”、“坏”)可以在训练和测试数据之间平均分配。它是统计学中的一种抽样方法。我们应该避免在回归问题中使用分层。下面的代码应该可以工作

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
于 2021-08-16T19:04:45.850 回答
-1

stratify=y拆分训练和测试数据时删除

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
于 2020-04-03T23:42:31.697 回答
-2

试试这种方式,它对我有用,这里也提到

x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .
于 2020-02-14T23:53:57.803 回答