train_test_split
我正在尝试使用scikit-learn中的函数将我的数据集拆分为训练集和测试集,但出现此错误:
In [1]: y.iloc[:,0].value_counts()
Out[1]:
M2 38
M1 35
M4 29
M5 15
M0 15
M3 15
In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]:
Traceback (most recent call last):
File "run_ok.py", line 48, in <module>
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
for train, test in self._iter_indices(X, y, groups):
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
但是,所有类都至少有 15 个样本。为什么我会收到此错误?
X 是一个表示数据点的 pandas DataFrame,y 是一个 pandas DataFrame,其中一列包含目标变量。
我不能发布原始数据,因为它是专有的,但是通过创建一个具有 1k 行 x 500 列的随机 pandas DataFrame (X) 和一个具有相同行数 (1k) X 的随机 pandas DataFrame (y) 是相当可复制的,并且,对于每一行,目标变量(分类标签)。y pandas DataFrame 应该有不同的分类标签(例如 'class1'、'class2'...),并且每个标签应该至少出现 15 次。