我有一个非常简单的逻辑回归模型,仅基于Race
和中的两个分类预测变量Sex
。首先,由于我有一些缺失值,为了确保所有缺失的数据都以 形式出现NA
,我使用以下命令导入数据框:
> mydata <- read.csv("~/Desktop/R/mydata.csv", sep=",", strip.white = TRUE,
+ na.strings= c("999", "NA", " ", ""))
这是预测变量的摘要,以查看有多少NA
个 s:
> # Define variables
>
> Y <- cbind(Support)
> X <- cbind(Race, Sex)
>
> summary(X)
Race Sex
Min. :1.000000 Min. :1.000000
1st Qu.:1.000000 1st Qu.:1.000000
Median :2.000000 Median :1.000000
Mean :1.608696 Mean :1.318245
3rd Qu.:2.000000 3rd Qu.:2.000000
Max. :3.000000 Max. :3.000000
NA's :420 NA's :42
由于缺少值,该模型似乎没有问题地做它应该做的事情:
> # Logit model coefficients
>
> logit <- glm(Y ~ X, family=binomial (link = "logit"))
>
> summary(logit)
Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0826825 -1.0911146 0.6473451 1.0190080 1.7457212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3457629 0.2884629 4.66529 3.0818e-06 ***
XRace -1.0716191 0.1339177 -8.00207 1.2235e-15 ***
XSex 0.5910812 0.1420270 4.16175 3.1581e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1434.5361 on 1057 degrees of freedom
Residual deviance: 1347.5684 on 1055 degrees of freedom
(420 observations deleted due to missingness)
AIC: 1353.5684
Number of Fisher Scoring iterations: 4
问题1:当我没有任何NA
s时,这段代码似乎运行良好。但是每当缺少值时,我都会收到一条错误消息。有没有办法仍然可以查看我有多少正确的预测值,无论是否丢失数据?
> table(true = Y, pred = round(fitted(logit)))
Error in table(true = Y, pred = round(fitted(logit))) :
all arguments must have the same length
编辑:添加na.action = na.exclude
到模型定义后,该表现在可以完美运行:
pred
true 0 1
0 259 178
1 208 413
当我使用此代码时,无论丢失数据如何,仍然可以将预测加载到原始数据框中。它正确地在数据帧的末尾添加了一个带有每行概率的“pred”列(NA
如果其中一个预测变量不存在,则只需添加一个代替):
> predictions = cbind(mydata, pred = predict(logit, newdata = mydata, type = "response"))
> write.csv(predictions, "~/Desktop/R/predictions.csv", row.names = F)
问题 2:但是,当我尝试预测一个新的数据框时——即使它具有相同的感兴趣变量——似乎关于缺失值的某些内容也会导致错误消息。是否有代码可以解决这个问题,或者我做错了什么?
> newpredictions = cbind(newdata, pred = predict(logit, newdata = newdata, type = "response"))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1475, 1478
In addition: Warning message:
'newdata' had 1475 rows but variables found have 1478 rows
如上所示,其中的行数mydata
为 1,478,其中的行数newdata
为 1,475。
谢谢您的帮助!