python - 如何将 SMOTENC 应用于包含对象和数字列的数据框？

Question

> In: data.dtypes

Out: Organization Name                                 object
Money Raised Currency (in USD)                   float64
Announced Date                            datetime64[ns]
Total Funding Amount Currency (in USD)           float64
Organization Description                          object
Organization Location                             object
Raised Series A                                    int64
Primary Industry                                  object
Sub_Ind                                           object
Sub_Ind2                                          object
Sub_Ind3                                          object
Sub_Ind4                                          object
Sub_Ind5                                          object
Sub_Ind6                                          object
Sub_Ind7                                          object
Investor1                                         object
Investor2                                         object
Investor3                                         object
Investor4                                         object
Investor5                                         object
Investor6                                         object
Investor7                                         object
Investor8                                         object
Investor9                                         object
Investor10                                        object
Investor11                                        object

> In: x = data.drop(columns=['Raised Series A', 'Announced Date'])

> In: y = data['Raised Series A']

> In: from imblearn.over_sampling import SMOTENC

> In: smote_nc = SMOTENC(categorical_features=[0,1,3,4,5,7,8,9,10,11,12,13,14,15,16,17,
18,19,20,21,22,23,24], random_state=0)

> In: x_resampled, y_resampled = smote_nc.fit_resample(x, y)

  ---------------------------------------------------------------------------
Out: ValueError                                Traceback (most recent call last)
 in 
----> 1 x_resampled, y_resampled = smote_nc.fit_resample(x, y)

~/opt/anaconda3/envs/unit2/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
     81         )
     82 
---> 83         output = self._fit_resample(X, y)
     84 
     85         y_ = (label_binarize(output[1], np.unique(y))

~/opt/anaconda3/envs/unit2/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    936     def _fit_resample(self, X, y):
    937         self.n_features_ = X.shape[1]
--> 938         self._validate_estimator()
    939 
    940         # compute the median of the standard deviation of the minority class

~/opt/anaconda3/envs/unit2/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _validate_estimator(self)
    921                 raise ValueError(
    922                     "Some of the categorical indices are out of range. Indices"
--> 923                     " should be between 0 and {}".format(self.n_features_)
    924                 )
    925             self.categorical_features_ = categorical_features

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 24

我一直在尝试将列组合包含在 categorical_features 参数中，但它们都不起作用。我的数据名声中也没有空值。我使用 Smotenc 的原因是因为我的目标向量非常倾斜：99.7% 是，0.3% 不是。请帮忙。

python - 如何将 SMOTENC 应用于包含对象和数字列的数据框？

0 回答 0

Related

Reference