python - ValueError 任何类的最小组数不能小于2

Question

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

这是我从以下代码中得到的错误

# List of machine learning algorithms that will be used for predictions
estimator = [('Logistic Regression', LogisticRegression), ('Ridge Classifier', RidgeClassifier), 
             ('SGD Classifier', SGDClassifier), ('Passive Aggressive Classifier', PassiveAggressiveClassifier), 
             ('SVC', SVC), ('Linear SVC', LinearSVC), ('Nu SVC', NuSVC), 
             ('K-Neighbors Classifier', KNeighborsClassifier),
             ('Gaussian Naive Bayes', GaussianNB), ('Multinomial Naive Bayes', MultinomialNB), 
             ('Bernoulli Naive Bayes', BernoulliNB), ('Complement Naive Bayes', ComplementNB), 
             ('Decision Tree Classifier', DecisionTreeClassifier), 
             ('Random Forest Classifier', RandomForestClassifier), ('AdaBoost Classifier', AdaBoostClassifier), 
             ('Gradient Boosting Classifier', GradientBoostingClassifier), ('Bagging Classifier', BaggingClassifier), 
             ('Extra Trees Classifier', ExtraTreesClassifier), ('XGBoost', XGBClassifier)]

# Separating independent features and dependent feature from the dataset
#X_train = titanic.drop(columns='Survived')
#y_train = titanic['Survived']

# Creating a dataframe to compare the performance of the machine learning models
comparison_cols = ['Algorithm', 'Training Time (Avg)', 'Accuracy (Avg)', 'Accuracy (3xSTD)']
comparison_df = pd.DataFrame(columns=comparison_cols)

# Generating training/validation dataset splits for cross validation
cv_split = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

# Performing cross-validation to estimate the performance of the models
for idx, est in enumerate(estimator):

    cv_results = cross_validate(est[1](), X, y, cv=cv_split)

    comparison_df.loc[idx, 'Algorithm'] = est[0]
    comparison_df.loc[idx, 'Training Time (Avg)'] = cv_results['fit_time'].mean()
    comparison_df.loc[idx, 'Accuracy (Avg)'] = cv_results['test_score'].mean()
    comparison_df.loc[idx, 'Accuracy (3xSTD)'] = cv_results['test_score'].std() * 3

comparison_df.set_index(keys='Algorithm', inplace=True)
comparison_df.sort_values(by='Accuracy (Avg)', ascending=False, inplace=True)

我猜 cv_split 部分给了我问题
我找到了使用 train_test_split 的解决方案，但这不会像 cv_split 那样返回它

但奇怪的是我将此代码与其他 kaggle 问题一起使用得很好，
所以我尝试比较两个 kaggle 的数据框的形状

kaggle 没问题
print(X.shape)
print(y.shape)
(891, 9)
(891,)
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1.....])

==================================================== ===========

kaggle 有问题（错误）
print(X.shape)
print(y.shape)
(15035, 24)
(15035,)
array([221900., 180000., 510000., ..., 360000., 400000., 325000 .])

两个内核的形状对我来说看起来一样我
不知道这两个内核的 X,y 的区别。

有人知道为什么会出现以下错误吗？

score 0 · Accepted Answer

我在使用 train_test_split 时遇到了类似的错误。这是因为我分配了参数stratify=data而不是stratify=target.

score 0 · Accepted Answer

你是不是在获取索引值..虽然不确定..你可以试试 StratifiedKFold ..下面对我有用

kfold = StratifiedKFold(n_splits=10, random_state=7) 结果 = cross_val_score(model, X_train, y_train, cv=kfold)

python - ValueError 任何类的最小组数不能小于2

2 回答 2

Related

Reference