0

我正在尝试基于 r 中的随机森林构建分类器。

重现此的代码:

    library(quantmod)
    library(randomForest)

    getSymbols('^GSPC', from="2002-01-01") 
    GSPC <- GSPC[,1:5] # remove adjusted close
    GSPC$wkret <- lag(GSPC$GSPC.Close,-5)/GSPC$GSPC.Close # build weekly future return
    GSPC$wkret <- GSPC$wkret * 100 -100 # build index

    cutoff <- floor(dim(GSPC)[1]/4) # select the row at 25%
    cutoffbreak <- sort(abs(as.data.frame(GSPC$wkret)[,1]),decreasing=T)[cutoff] # get the top 25% return in absolute terms
    y <- cut(GSPC$wkret, breaks=c('-100',-cutoffbreak,cutoffbreak ,'100'),labels=c('down','','up')) # build factors 
    randomForest(GSPC[1:100],y[1:100]) # select first 100 to exclude NA's, dimension problems.

这有效:

y[1:100]
[1]                                                                                      down      down down
 [22]                up   up        down      down                          up   up   up   up 
=== zip ===

> is.factor(y)
[1] TRUE

> x[1:100]
              open    high     low   close     volume
2002-01-02 1148.08 1154.67 1136.23 1154.67 1171000000
2002-01-03 1154.67 1165.27 1154.01 1165.27 1398900000
2002-01-04 1165.27 1176.55 1163.42 1172.51 1513000000
2002-01-07 1172.51 1176.97 1163.55 1164.89 1308300000
=== zip ===

> class(x)
[1] "xts" "zoo"

这有效(但当然没有意义):

lm(y[1:100] ~ .,data=x[1:100])

但是建立一个随机森林会给出:

> rf <- randomForest(y[1:100] ~ .,data=x[1:100])
Error in randomForest.default(m, y, ...) : subscript out of bounds

> traceback()
4: randomForest.default(m, y, ...)
3: randomForest(m, y, ...)
2: randomForest.formula(y[1:100] ~ ., data = x[1:100])
1: randomForest(y[1:100] ~ ., data = x[1:100])

谷歌搜索说这是一个尺寸问题,但无法弄清楚为什么/如何。

r 版本:

R.version _
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012 -06-22) 昵称烤棉花糖

库版本:

    randomForest version: "2.15.1"
    quantmod version: "2.15.1"
4

2 回答 2

2

我创建 y 时出了点问题。当我添加此代码时,代码运行良好:

    y <- as.factor(as.numeric(y))

我不知道我的 y 值有什么问题,但我认识到只有在我提供完整代码时才能重现。

    > randomForest(na.omit(GSPC),y[1:2713])
    Error in randomForest.default(na.omit(GSPC), y[1:2713]) : 
      subscript out of bounds
    > y <- as.factor(as.numeric(y))
    > randomForest(na.omit(GSPC),y[1:2713])

    Call:
     randomForest(x = na.omit(GSPC), y = y[1:2713]) 
                   Type of random forest: classification
                         Number of trees: 500
    No. of variables tried at each split: 2

            OOB estimate of  error rate: 0.07%
    Confusion matrix:
        1    2   3 class.error
    1 348    1   0 0.002865330
    2   0 2034   0 0.000000000
    3   0    1 329 0.003030303
于 2012-10-16T17:33:25.620 回答
1

使用公式和数据参数调用 randomForest 非常常见,但x[1:100]它不是矩阵,而是向量。我想你的意思是x[1:100,]

此外,数据的参数应该是数据框,而不是矩阵。我假设 x 是一个矩阵(而不是一个数据框),x[1:100]否则会返回以下错误消息:

Error in `[.data.frame`(x, 100) : undefined columns selected

或者,根据评论的建议,您也可以运行

randomForest( x[ 1:100, ], y[ 1:100 ] )
于 2012-10-15T16:58:56.583 回答