python - SMOTE 为所有分类数据集提供数组大小/ValueError

Question

我正在使用 SMOTE-NC 对分类数据进行过采样。我只有 1 个功能和 10500 个样本。

运行以下代码时，我收到错误：

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
     16 print(X_new.shape) # (10500, 1)
     17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     81         )
     82 
---> 83         output = self._fit_resample(X, y)
     84 
     85         y_ = (label_binarize(output[1], np.unique(y))

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
    926 
    927         X_continuous = X[:, self.continuous_features_]
--> 928         X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
    929         X_minority = _safe_indexing(
    930             X_continuous, np.flatnonzero(y == class_minority)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    592                              " a minimum of %d is required%s."
    593                              % (n_features, array.shape, ensure_min_features,
--> 594                                 context))
    595 
    596     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.

代码：

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC

sm = SMOTENC(random_state=27,categorical_features=[0,])

X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())

print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)

X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew

print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)

如果我理解正确，形状X_new应该是 (n_samples, n_features)，即 10500 X 1。我不知道为什么在 ValueError 中将其视为 shape=(10500,0)

有人可以在这里帮助我吗？

score 1 · Accepted Answer

我已经复制了您的问题，将文档中的示例改编为数据中的单个分类特征：

from collections import Counter
from numpy.random import RandomState
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC

X, y = make_classification(n_classes=2, class_sep=2,
 weights=[0.1, 0.9], n_informative=1, n_redundant=0, flip_y=0,
 n_features=1, n_clusters_per_class=1, n_samples=1000, random_state=10)

# simulate the only column to be a categorical feature
X[:, 0] = RandomState(10).randint(0, 4, size=(1000))
X.shape
# (1000, 1)

sm = SMOTENC(random_state=42, categorical_features=[0,]) # same behavior with categorical_features=[0]

X_res, y_res = sm.fit_resample(X, y)

这给出了同样的错误：

ValueError: Found array with 0 feature(s) (shape=(1000, 0)) while a minimum of 1 is required.

原因其实很简单，但是你要对原来的SMOTE 论文稍加挖掘；引用相关部分（强调我的）：

虽然我们的 SMOTE 方法目前不能处理具有所有名义特征的数据集，但它被推广到处理连续和名义特征的混合数据集。我们将这种方法称为合成少数过采样 TEchnique-Nominal Continuous [SMOTE-NC]。我们在 UCI 存储库中的成人数据集上测试了这种方法。SMOTE-NC 算法如下所述。

中值计算：计算少数类所有连续特征的标准差的中值。如果样本与其潜在的最近邻之间的名义特征不同，则该中值将包含在欧几里德距离计算中。我们使用中值来惩罚名义特征的差异，其数量与连续特征值的典型差异相关。

最近邻计算：使用连续特征空间计算正在识别 k 最近邻的特征向量（少数类样本）与其他特征向量（少数类样本）之间的欧几里得距离。对于所考虑的特征向量与其潜在最近邻之间的每个不同的标称特征，在欧几里得距离计算中包括先前计算的标准差的中值。

换句话说，虽然没有明确说明，但很明显，为了使算法工作，它至少需要一个连续特征。这不是这里的情况，所以该算法相当不出所料地失败了。

我猜想，在内部，在第 1 步（中值计算）期间，该算法会暂时从数据中删除所有分类特征；在这里这样做，它确实面临着(1000, 0)（或(10500, 0)在你的情况下）的形状，即没有数据，因此错误消息中的具体参考。

所以，这里没有任何实际的编程问题需要解决，只是你尝试做的事情实际上是不可能的 SMOTE-NC 算法（请注意，算法名称中的首字母 NC 表示Nominal-Continuous）。

python - SMOTE 为所有分类数据集提供数组大小/ValueError

1 回答 1

Related

Reference