0

I am trying to use SMOTENC from the imbalanced learn library to oversample a dataframe that includes both categorical and numerical variables. There are 55 columns in total where 3 of them are numerical. The number of samples per class (value counts) in the dataset is below:

ID   #of samples
2    281
0    184
6     57
4     27
3      5
7      3

I am attempting to oversample this dataset with the code:

sm = SMOTENC(cat_features, random_state=42, k_neighbors=1)
x_res, y_res = sm.fit_resample(x, y)

Where cat_features contains the indexes of the categorical columns, y contains the class membership of each sample and x contains the rest of the features. However, I cannot oversample this dataset and instead I get the error ValueError: could not broadcast input array from the shape (3,96) into shape (184,96). As far as I could understand, the error is related to the class with id 7. Why can't SMOTENC oversample this class? Is there a limit such as the minimum number of samples that are needed to be provided to oversample a dataset? Also, I do not have 96 columns, where does that come from?

More detailed information about the error is the following: Note that all of my categorical features are binary, which are already one-hot encoded. Thus, it does not require any additional encoding by SMOTENC, which means that the number of columns is not supposed to be increased. The exact line that the error occurs is

..\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
    ] = self._X_categorical_minority_encoded

The comment for this function (_generate_samples) says that "In the case that the median std was equal to zeros, we have to create non-null entry based on the encoded of OHE". The complete part where the error is raised in base.py is

if math.isclose(self.median_std_, 0):
            nn_data[
                :, self.continuous_features_.size :
            ] = self._X_categorical_minority_encoded

However, I do not understand how the standard deviation can be zero because I know that the samples in the dataset are not identical to each other, thus the std must be something different from 0. I know that this column is one of the two sparse numerical columns (they have a lot of zero values). In the source code of the SMOTE, I have seen that there is a function to handle sparse columns, but it seems that it does not work fine and that is how I am getting an error. I am not sure how to overcome this problem and I appreciate any help or recommendation regarding this.

4

0 回答 0