r - SMOTE 平衡 R 中的 200 多个类

Question

我有一个包含 200 多个类的两列数据集（特征和类），输入特征必须分类到这些类中。对于某些类，类的出现范围从 1 到几千不等。特征列有文本和数字。我尝试了以下方式：

来自 UBL 的 SMOTE

SmoteClassif(lab ~ ., dat, C.perc = "balance",dist="HEOM")

这给出了警告：

Warning messages:
1: SmoteClassif :: Nr of examples is less or equal to k.
 Using k = 1 in the nearest neighbours computation in this bump.
2: SmoteClassif :: Nr of examples is less or equal to k.
 Using k = 1 in the nearest neighbours computation in this bump.
3: SmoteClassif :: Nr of examples is less or equal to k.
 Using k = 2 in the nearest neighbours computation in this bump.
4: SmoteClassif :: Nr of examples is less or equal to k.
 Using k = 2 in the nearest neighbours computation in this bump.

但这仍然很好地平衡了所有类lab。但是，并非所有特征都存在于 SMOTED 数据集中。这不是数据丢失，即缺少训练模型所需的特征吗？我是这个领域的新手。警告是否解释了问题？我已经尝试过k=1，但最终结果仍然相同。

任何建议都会有所帮助。

score 3 · Accepted Answer

UBL 包中实现的 SmoteClassif 函数将使用 SMOTE 过程的过采样与随机欠采样相结合。

这意味着当您使用“平衡”选项时，该函数将为最稀有的类生成新案例，并从人口最多的类中删除案例。最终的目标是获得一个与原始数据集大小大致相同的新平衡数据集。因此，当您使用选项“平衡”时，您将生成新的综合案例，并将从最常见的类中删除案例，以便您最终得到一个与初始数据集大小相似的数据集。

如果您只想应用过采样过程，则需要在 C.perc 参数中指定要对每个类应用多少过采样。例如，您可以设置

C.perc = list(A = 2, B=3)

这将复制 A 类的元素并将 B 类的元素复制三倍，而其余数据集不变（所有其他类保持其频率）。在这种情况下，您的数据集会被新的合成扩大，并且不会丢弃任何信息！

一个简单的例子：

library(MASS)
data(cats)
table(cats$Sex)

F  M  
47 97 

# class M is duplicated
mysmote.cats <- SmoteClassif(Sex~., cats, list(M = 2))
table(mysmote.cats$Sex)

F   M 
 47 194 

#class M is oversampled by 150% and class F is undersampled by 50%
mysmote.cats <- SmoteClassif(Sex~., cats, list(M = 1.5, F=0.5))
table(mysmote.cats$Sex)

F   M 
 23 145

关于警告，该函数的默认值是在计算来自一个特定类的示例的最近邻居时使用 k=5。但是，在某些数据集中，无法计算选定的邻居数量，因为没有足够的示例。例如，如果您只有 3 个 A 类示例，当您从该类中选择一个案例时，您将最多可以找到该类中的 2 个最近邻！

So, when the number k selected is too large to determine that specific number of neighbors of a case a warning is shown.

r - SMOTE 平衡 R 中的 200 多个类

来自 UBL 的 SMOTE

1 回答 1

Related

Reference