r - 在 R 中读取表和随机森林

Question

我正在尝试在 R 中使用随机森林方法。我需要读取一个 txt 文件（训练集）。

dataset<- read.table(path1,header=TRUE,sep=",")

列名是数字（即 1005_at），因此它们会自动转换，添加 X，由 R（即 X1005_at）。为了解决这个问题，我做了：

colnames(dataset)<-gsub("^[X](.*)","\\1",colnames(dataset))

现在名称没问题，但是当我运行随机森林时：

model.rf <- randomForest(class ~ ., data=dataset, importance=TRUE,keep.forest=T, ntree=5, do.trace=T)

我有这个错误：

Error in eval(expr, envir, enclos) : object '1005_at' not found

如果我在原始数据集上运行随机森林（不修改名称，因此使用 X1005_at），则不会发生此错误。为什么？我该如何解决？

score 0 · Accepted Answer

使用read.csv它已经具有适当的默认值，header并sep使用check.names=FALSE参数来避免混淆名称。

的formula方法randomForest将不接受输入数据框中的非语法名称。请改用默认方法。

因此我们有：

> # dataset <- read.csv(path1, check.names = FALSE)
> 
> # next few lines are to make example similar to the one in the question
> dataset <- CO2
> names(dataset) <- c(paste(1:4, names(dataset[1:4]), sep = "_"), "class")
> names(dataset)
[1] "1_Plant"     "2_Type"      "3_Treatment" "4_conc"      "class"      
> 
> i <- match("class", names(dataset)) # i is index of class column
> fm <- randomForest(dataset[-i], dataset[[i]]
+    # other arguments - in this example none
+ )
> fm

Call:
 randomForest(x = dataset[-i], y = dataset[[i]]) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 26.43385
                    % Var explained: 77.13
> fm$importance
            IncNodePurity
1_Plant          2105.779
2_Type           1529.527
3_Treatment       557.300
4_conc           2265.724

r - 在 R 中读取表和随机森林

1 回答 1

Related

Reference