我已经使用stats 包中的glm函数构建了一个逻辑回归模型。我现在想预测该模型对大量值的结果,这些值存储在“ffdf”对象(请参阅 ff 包)中,但是我不知道如何进行:
如何创建我的 ffdf 对象的子集,以便仅保留要在我的预测中使用的变量(即列)?- 需要在预测函数中指定为输入
接下来我应该如何进行?在predict()、predict.glm()、predict.bigglm()之间应该使用哪个函数(也许 biglm 包有帮助)?
提前感谢您对此的看法!
此致
更新
感谢您的反馈 BondedDust。
让我更准确地说,这确实是一个编码问题,旨在基于一个 ffdf 对象(学习数据集)进行逻辑回归,并预测另一个 ffdf 对象(测试数据集)的模型结果。
(1/3) 学习数据集:ffdf 对象(使用 ff 包创建)。
` class(train.random.sample)` >
[1] "ffdf"
以下是 ffdf 对象的结构,以备不时之需:
`str(train.random.sample) ` >
List of 3
$ virtual: 'data.frame': 27 obs. of 7 variables:
.. $ VirtualVmode : chr "integer" "integer" "integer" "integer" ...
.. $ AsIs : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ VirtualIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalElementNo: int 1 2 3 4 5 6 7 8 9 10 ...
.. $ PhysicalFirstCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. $ PhysicalLastCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. - attr(*, "Dim")= int 500000 27
.. - attr(*, "Dimorder")= int 1 2
$ physical: List of 27
.. $ id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ click : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ hour : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ C1 : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ banner_pos : list()
.. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
.. .. ..- attr(*, "vmode")= chr "integer"
.. .. ..- attr(*, "maxlength")= int 500000
.. .. ..- attr(*, "pattern")= chr "ffdf"
.. .. ..- attr(*, "filename")= chr "anonymized.ff"
.. .. ..- attr(*, "pagesize")= int 65536
.. .. ..- attr(*, "finalizer")= chr "delete"
.. .. ..- attr(*, "finonexit")= logi TRUE
.. .. ..- attr(*, "readonly")= logi FALSE
.. .. ..- attr(*, "caching")= chr "mmnoflush"
.. ..- attr(*, "virtual")= list()
.. .. ..- attr(*, "Length")= int 500000
.. .. ..- attr(*, "Symmetric")= logi FALSE
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_domain : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_category : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_domain : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_category : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_ip : list()
….
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_os : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_make : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_model : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_type : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_conn_type : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_geo_country: list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ C17 : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
$ row.names: NULL
- attributes: List of 2
.. $ names: chr [1:3] "virtual" "physical" "row.names"
.. $ class: chr "ffdf"
(2/3)基于学习数据集的 逻辑回归:
目标是根据“baser_pos”输入来学习/预测“点击”结果
`logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")
summary(logreg1)` >
Call:
glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0555 -0.6495 -0.5951 -0.5951 1.9071
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.641416 0.004702 -349.12 <2e-16 xxx
banner_pos 0.192534 0.007595 25.35 <2e-16 xxx
---
Signif. codes: 0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 458848 on 499999 degrees of freedom
Residual deviance: 458215 on 499998 degrees of freedom
AIC: 458219
Number of Fisher Scoring iterations: 4
`class(logreg1)`>
[1] "glm" "lm"
(3/3)测试数据集:ffdf对象(用ff包创建)。
`class(df.test)` >
[1] "ffdf"
测试数据集结构与训练数据集相同,约 480 万行
`str(df.test)`>
List of 3
$ virtual: 'data.frame': 26 obs. of 7 variables:
.. $ VirtualVmode : chr "integer" "integer" "integer" "integer" ...
.. $ AsIs : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ VirtualIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalElementNo: int 1 2 3 4 5 6 7 8 9 10 ...
.. $ PhysicalFirstCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. $ PhysicalLastCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. - attr(*, "Dim")= int 4769401 26
.. - attr(*, "Dimorder")= int 1 2
$ physical: List of 26
…
我无法成功预测点击结果。我首先尝试创建一个包含 banner_pos 变量的数据框或 ffdf 对象:
`modeldata <- df.test[["banner_pos"]]`
然后我试图预测结果:
`predict.glm(object = logreg1, newdata = modeldata, type = "response")`
Error in as.data.frame.default(data) :
cannot coerce class "c("ff_vector", "ff")" to a data.frame
我的代码有问题吗?我应该使用其他功能来利用其他软件包(例如 biglm)吗?
非常感谢您对这个问题的看法,
最好的问候