我有一个大数据集,用于运行带有一些定性预测变量的线性回归模型。我将数据集称为 WN,定性变量是 OState 和 DState(美国的州)。在这里,您将看到 WN 中的 OState 和 DState 有 62 个唯一值:
> unique(WN$OState)
[1] NY MA PA DE DC VA MD WV NC RI SC NH GA FL AL TN MS ME KY OH IN MI VT IA WI MN SD ND MT CT IL MO KS NE NJ LA AR OK TX CO WY ID UT AZ NM NV CA OR WA
62 Levels: AA AE AK AL AP AR AS AZ CA CO CT DC DE FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA ... WY
> unique(WN$DState)
[1] MA RI NH ME VT CT NY NJ PA DE DC VA MD WV NC SC GA FL AL TN MS KY OH IN MI IA WI MN SD ND MT IL MO KS NE LA AR OK TX CO WY ID UT AZ NM NV CA OR WA
62 Levels: AA AE AK AL AP AR AS AZ CA CO CT DC DE FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA ... WY
现在我正在运行回归模型来预测带有距离、OState 和 DState 的速率,如下所示:
> WN.LR = lm(WN$Rate~WN$Distance+WN$OState+WN$DState)
当我检查回归摘要时,我看到仅填充了 48 个 OState 和 DState 预测变量,其余 14 个缺失。下面给出了摘要输出的一小部分。例如,您会看到输出中缺少 OStateAL:
> summary(WN.LR)
Call:
lm(formula = WN$Rate ~ WN$Distance + WN$OState + WN$DState)
Residuals:
Min 1Q Median 3Q Max
-2370.3 -218.4 -18.9 170.8 9105.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.208e+03 6.632e+00 182.171 < 2e-16 ***
WN$Distance 1.626e+00 3.111e-03 522.722 < 2e-16 ***
WN$OStateAR 2.000e+02 7.294e+00 27.419 < 2e-16 ***
WN$OStateAZ 1.981e+02 8.372e+00 23.667 < 2e-16 ***
WN$OStateCA 1.056e+02 7.919e+00 13.340 < 2e-16 ***
WN$OStateCO 1.323e+02 7.332e+00 18.043 < 2e-16 ***
WN$OStateCT -2.019e+02 1.827e+01 -11.048 < 2e-16 ***
WN$OStateDC 5.711e+02 2.178e+01 26.223 < 2e-16 ***
另一方面,当我检查具有 OState = "AL" 的实体时,我看到有超过 6000 行:
> WNnew<-subset(WN,OState=="AL")
> nrow(WNnew)
[1] 6213
对此有何解释?