python - scikit-learn 错误：y 中人口最少的类只有 1 个成员

Question

train_test_split我正在尝试使用scikit-learn中的函数将我的数据集拆分为训练集和测试集，但出现此错误：

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

但是，所有类都至少有 15 个样本。为什么我会收到此错误？

X 是一个表示数据点的 pandas DataFrame，y 是一个 pandas DataFrame，其中一列包含目标变量。

我不能发布原始数据，因为它是专有的，但是通过创建一个具有 1k 行 x 500 列的随机 pandas DataFrame (X) 和一个具有相同行数 (1k) X 的随机 pandas DataFrame (y) 是相当可复制的，并且，对于每一行，目标变量（分类标签）。y pandas DataFrame 应该有不同的分类标签（例如 'class1'、'class2'...），并且每个标签应该至少出现 15 次。

score 7 · Accepted Answer

问题是它train_test_split需要 2 个数组作为输入，但该y数组是一个单列矩阵。如果我只通过它的第一列y。

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])

score 3 · Accepted Answer

要点是，如果您使用分层 CV，那么如果拆分数量无法生成所有 CV 拆分，且数据中所有类的比率相同，您将收到此警告。例如，如果您有一个类的 2 个样本，则将有 2 个 CV 集和 2 个此类样本，以及 3 个 CV 集和 0 个样本，因此此类的比率样本在所有 CV 集中不相等。但问题仅在于任何一组中有 0 个样本，因此如果您的样本至少与 CV 拆分的数量一样多，即在这种情况下为 5，则不会出现此警告。

请参阅https://stackoverflow.com/a/48314533/2340939。

score 1 · Accepted Answer

继续user2340939 的回答。如果您确实需要对训练测试拆分进行分层，尽管某个类中的行数较少，您可以尝试使用以下方法。我通常使用相同的方法，将此类类的所有行复制到训练数据集和测试数据集。

from sklearn.model_selection import train_test_split

def get_min_required_rows(test_size=0.2):
    return 1 / test_size

def make_stratified_splits(df, y_col="label", test_size=0.2):
    """
        for any class with rows less than min_required_rows corresponding to the input test_size,
        all the rows associated with the specific class will have a copy in both the train and test splits.
        
        example: if test_size is 0.2 (20% otherwise),
        min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
        where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
    """
    
    id_col = "id"
    temp_col = "same-class-rows"
    
    class_to_counts = df[y_col].value_counts()
    df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
    
    min_required_rows = get_min_required_rows(test_size)
    copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
    valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
    
    X = valid_rows[id_col].tolist()
    y = valid_rows[y_col].tolist()
    
    # notice, this train_test_split is a stratified split
    X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
    
    X_test = X_test + copy_rows[id_col].tolist()
    X_train = X_train + copy_rows[id_col].tolist()
    
    df.drop([temp_col], axis=1, inplace=True)
    
    test_df = df[df[id_col].isin(X_test)].copy(deep=True)
    train_df = df[df[id_col].isin(X_train)].copy(deep=True)
    
    print (f"number of rows in the original dataset: {len(df)}")
    
    test_prop = round(len(test_df) / len(df) * 100, 2)
    train_prop = round(len(train_df) / len(df) * 100, 2)
    print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
    
    return train_df, test_df

score 0 · Accepted Answer

from sklearn.model_selection import train_test_split

all_keys = df['Key'].unique().tolist()

t_df = pd.DataFrame()
c_df = pd.DataFrame()

for key in all_keys:
    print(key)
    if df.loc[df['Key']==key].shape[0] < 2 :
        t_df = t_df.append(df.loc[df['Key']==key])
    else:
        df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
        t_df = t_df.append(df_t)
        c_df = c_df.append(df_c)

score 0 · Accepted Answer

我遇到了这个问题，因为我要拆分的一些东西是列表，有些是数组。当我将数组转换为列表时，它起作用了。

score 0 · Accepted Answer

我也有同样的问题。有些班级只有一两个项目。（我的问题是多班级问题）。您可以删除或联合具有较少项目的类。我这样解决我的问题。

score -1 · Accepted Answer

删除分层。

stratify=y

应该只在分类问题的情况下使用，以便各种输出类（比如“好”、“坏”）可以在训练和测试数据之间平均分配。它是统计学中的一种抽样方法。我们应该避免在回归问题中使用分层。下面的代码应该可以工作

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

score -1 · Accepted Answer

stratify=y拆分训练和测试数据时删除

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

score -2 · Accepted Answer

试试这种方式，它对我有用，这里也提到了：

x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .

python - scikit-learn 错误：y 中人口最少的类只有 1 个成员

9 回答 9

Related

Reference