-1

我已经使用stats 包中的glm函数构建了一个逻辑回归模型。我现在想预测该模型对大量值的结果,这些值存储在“ffdf”对象(请参阅 ff 包)中,但是我不知道如何进行:

  1. 如何创建我的 ffdf 对象的子集,以便仅保留要在我的预测中使用的变量(即列)?- 需要在预测函数中指定为输入

  2. 接下来我应该如何进行?在predict()、predict.glm()、predict.bigglm()之间应该使用哪个函数(也许 biglm 包有帮助)?

提前感谢您对此的看法!

此致

更新

感谢您的反馈 BondedDust。
让我更准确地说,这确实是一个编码问题,旨在基于一个 ffdf 对象(学习数据集)进行逻辑回归,并预测另一个 ffdf 对象(测试数据集)的模型结果。

(1/3) 学习数据集:ffdf 对象(使用 ff 包创建)。

` class(train.random.sample)` >   
[1] "ffdf"

以下是 ffdf 对象的结构,以备不时之需:

`str(train.random.sample) ` >

List of 3   
 $ virtual: 'data.frame':   27 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
 .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
 .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. - attr(*, "Dim")= int  500000 27   
 .. - attr(*, "Dimorder")= int  1 2   
 $ physical: List of 27   
 .. $ id                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ click             : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ hour              : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C1                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ banner_pos        : list()   
 ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>    
 ..  .. ..- attr(*, "vmode")= chr "integer"   
 ..  .. ..- attr(*, "maxlength")= int 500000   
 ..  .. ..- attr(*, "pattern")= chr "ffdf"   
 ..  .. ..- attr(*, "filename")= chr "anonymized.ff"   
 ..  .. ..- attr(*, "pagesize")= int 65536   
 ..  .. ..- attr(*, "finalizer")= chr "delete"   
 ..  .. ..- attr(*, "finonexit")= logi TRUE   
 ..  .. ..- attr(*, "readonly")= logi FALSE   
 ..  .. ..- attr(*, "caching")= chr "mmnoflush"   
 ..  ..- attr(*, "virtual")= list()   
 ..  .. ..- attr(*, "Length")= int 500000   
 ..  .. ..- attr(*, "Symmetric")= logi FALSE    
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_id           : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_domain       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_category     : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_id            : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_domain        : list()   
…  
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_category      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_id         : list()   
 …   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_ip         : list()   
….   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_os         : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_make       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_model      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_type       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_conn_type  : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_geo_country: list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C17               : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
$ row.names:  NULL   
- attributes: List of 2   
 .. $ names: chr [1:3] "virtual" "physical" "row.names"   
 .. $ class: chr "ffdf"   

(2/3)基于学习数据集的 逻辑回归:

目标是根据“baser_pos”输入来学习/预测“点击”结果

`logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")   
summary(logreg1)` >   


Call:
glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0555  -0.6495  -0.5951  -0.5951   1.9071  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.641416   0.004702 -349.12   <2e-16 xxx
banner_pos   0.192534   0.007595   25.35   <2e-16 xxx
---
Signif. codes:  0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 458848  on 499999  degrees of freedom
Residual deviance: 458215  on 499998  degrees of freedom
AIC: 458219

Number of Fisher Scoring iterations: 4

`class(logreg1)`>
[1] "glm" "lm" 

(3/3)测试数据集:ffdf对象(用ff包创建)。

`class(df.test)` >   
[1] "ffdf"

测试数据集结构与训练数据集相同,约 480 万行

`str(df.test)`>   

List of 3   
 $ virtual: 'data.frame':   26 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
.. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
.. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
.. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
.. - attr(*, "Dim")= int  4769401 26   
.. - attr(*, "Dimorder")= int  1 2   
$ physical: List of 26   
…   

我无法成功预测点击结果。我首先尝试创建一个包含 banner_pos 变量的数据框或 ffdf 对象:

`modeldata <- df.test[["banner_pos"]]`

然后我试图预测结果:

`predict.glm(object = logreg1, newdata = modeldata, type = "response")`

Error in as.data.frame.default(data) : 
  cannot coerce class "c("ff_vector", "ff")" to a data.frame

我的代码有问题吗?我应该使用其他功能来利用其他软件包(例如 biglm)吗?
非常感谢您对这个问题的看法,
最好的问候

4

1 回答 1

0

ffdf与此类似的东西会在你的旁边得分glm

require(ff)
df.test$score <- ff(as.numeric(NA), length = nrow(df.test))
chunks <- chunk(df.test)
for(chunkrangeindex in chunks){
  df.test$score[chunkrangeindex] <- predict(object = logreg1, newdata = df.test[chunkrangeindex, ], type = "response")
}
于 2014-11-12T21:15:13.997 回答