r - C5.0 决策树 - 名为 exit 的 c50 代码，值为 1

Question

我收到以下错误

c50 代码调用退出，值为 1

我正在根据 Kaggle 提供的泰坦尼克号数据进行此操作

# Importing datasets
train <- read.csv("train.csv", sep=",")

# this is the structure
  str(train)

输出：-

    'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

然后我尝试使用 C5.0 dtree

# Trying with C5.0 decision tree
library(C50)

#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)

new_model <- C5.0(train[-2],train$Survived)

所以运行上面的行给了我这个错误

c50 code called exit with value 1

我无法弄清楚出了什么问题？我在不同的数据集上使用了类似的代码，并且运行良好。关于如何调试代码的任何想法？

-谢谢

score 15 · Accepted Answer

对于任何感兴趣的人，可以在这里找到数据：http ://www.kaggle.com/c/titanic-gettingStarted/data 。我认为您需要注册才能下载它。

关于你的问题，首先我认为你的意思是写

new_model <- C5.0(train[,-2],train$Survived)

接下来，注意Cabin和EmbarkedColumns 的结构。这两个因素有一个空字符作为级别名称（检查levels(train$Embarked)）。这是C50跌倒的地方。如果您修改您的数据，使得

levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"

您的算法现在将运行而不会出现错误。

score 8 · Accepted Answer

以防万一。您可以通过以下方式查看错误

summary(new_model)

当变量名称中有特殊字符时，也会发生此错误。例如，如果变量名称中有“я”（来自俄语字母）字符，则会出现此错误。

score 6 · Accepted Answer

这是最终起作用的方法：-

看完这篇文章就有了这个想法

library(C50)

test$Survived <- NA

combinedData <- rbind(train,test)

combinedData$Survived <- factor(combinedData$Survived)

# fixing empty character level names 
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"

new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]

new_model <- C5.0(new_train[,-2],new_train$Survived)

new_model_predict <- predict(new_model,new_test)

submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)

这背后的直觉是，这样训练和测试数据集都将具有一致的因子水平。

score 3 · Accepted Answer

我有同样的错误，但我使用的是没有缺失值的数字数据集。

过了很长时间，我发现我的数据集有一个预测属性叫做"outcome"并C5.0Control使用这个名字，这就是错误原因:'(

我的解决方案是更改列名。其他方式，将创建一个 C5.0Control对象并更改标签属性的值，然后将此对象作为 C50 方法的参数传递。

score 0 · Accepted Answer

在构建模型和预测时，我也为同样的问题（返回代码“1”）苦苦挣扎了几个小时。在 Marco 的回答提示下，我编写了一个小函数来删除数据框或向量中等于“”的所有因子级别，请参见下面的代码。但是，由于 R 不允许通过引用传递函数，因此您必须使用函数的结果（它不能更改原始数据帧）：

removeBlankLevelsInDataFrame <- function(dataframe) {
  for (i in 1:ncol(dataframe)) {
    levels <- levels(dataframe[, i])
    if (!is.null(levels) && levels[1] == "") {
      levels(dataframe[,i])[1] = "?"
    }
  }
  dataframe
}

removeBlankLevelsInVector <- function(vector) {
  levels <- levels(vector)
  if (!is.null(levels) && levels[1] == "") {
    levels(vector)[1] = "?"
  }
  vector
}

函数的调用可能如下所示：

trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)

但是，C50 似乎有一个类似的问题，字符列包含一个空单元格，所以如果你有一些字符属性，你可能必须扩展它来处理字符属性。

score 0 · Accepted Answer

我也遇到了同样的错误，但这是因为其中一列的因子级别中有一些非法字符。

我使用make.names了函数并更正了因子水平：

levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))

然后问题就解决了。

r - C5.0 决策树 - 名为 exit 的 c50 代码，值为 1

6 回答 6

Related

Reference