r - R：变量在节点和数据中具有不同的级别数

Question

我想使用bnlearn朴素贝叶斯算法进行分类任务。

我使用这个数据集进行测试。其中 3 个变量是连续的 ()V2、V4、V10)，其他变量是离散的。据我所知bnlearn，不能使用连续变量，因此需要将它们转换为因子或离散化。现在我想将所有特征转换为因子。但是，我遇到了一些问题。这是一个示例代码

dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...

trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)

# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)

...

对于此代码，我在调用时收到错误消息predict()

'V1' 在节点和数据中具有不同数量的级别。

当我从训练集中删除那个 V1 时，我得到 V2 变量的相同错误。但是，当我进行分解时，错误就会消失dataSet [] <- lapply(dataSet, as.factor)，只是将其拆分为训练集和测试集。

那么哪个是优雅的解决方案呢？因为在现实世界的应用程序中，测试和训练集可以来自不同的来源。有任何想法吗？

score 0 · Accepted Answer

该问题似乎是由于我的训练和测试数据集具有不同的因子水平而引起的。我通过使用rbind命令组合两个不同的数据帧（训练和测试）来解决这个问题，申请as.factor获取完整数据集的完整因子集，然后将分解后的数据帧切回到单独的训练和测试数据集中。

train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]

complete <- rbind(learn, test)    
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]

bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)

我希望这会有所帮助。

r - R：变量在节点和数据中具有不同的级别数

1 回答 1

Related

Reference