r - 在 R 中强制逻辑模型中的参考类别

Question

使用 R，我正在运行一个逻辑模型，并且需要以下列方式包含一个交互项，其中 A 是分类的，B 是连续的。

Y ~ A + B + normalized(B):A

我的问题是，当我这样做时，参考类别与

Y ~ A + B + A:B

这使得模型的比较变得困难。我确信有一种方法可以强制参考类别始终相同，但似乎无法找到一个直截了当的答案。

为了说明，我的数据如下所示：

income                      ndvi        sga
30,000$ - 49,999$        -0,141177617        0
30,000$ - 49,999$        -0,170513257        0
>80,000$                 -0,054939323        1
>80,000$                 -0,14724104         0
>80,000$                 -0,207678157        0
missing                  -0,229890869        1
50,000$ - 79,999$         0,245063253        0
50,000$ - 79,999$         0,127565529        0
15,000$ - 29,999$        -0,145778357        0
15,000$ - 29,999$        -0,170944338        0
30,000$ - 49,999$        -0,121060635        0
30,000$ - 49,999$        -0,245407291        0
missing                  -0,156427532        0
>80,000$                  0,033541238        0

输出如下所示。第一组结果是模型 Y ~ A*B 的形式，第二组是 Y ~ A + B + A:normalized(B)

                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            -2.72175    0.29806  -9.132   <2e-16 ***
ndvi                                    2.78106    2.16531   1.284   0.1990    
income15,000$ - 29,999$                -0.53539    0.46211  -1.159   0.2466    
income30,000$ - 49,999$                -0.68254    0.39479  -1.729   0.0838 .  
income50,000$ - 79,999$                -0.13429    0.33097  -0.406   0.6849    
income>80,000$                         -0.56692    0.35144  -1.613   0.1067    
incomemissing                          -0.85257    0.47230  -1.805   0.0711 .  
ndvi:income15,000$ - 29,999$           -2.27703    3.25433  -0.700   0.4841    
ndvi:income30,000$ - 49,999$           -3.76892    2.86099  -1.317   0.1877    
ndvi:income50,000$ - 79,999$           -0.07278    2.46483  -0.030   0.9764    
ndvi:income>80,000$                    -3.32489    2.62000  -1.269   0.2044    
ndvi:incomemissing                     -3.98098    3.35447  -1.187   0.2353 

                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              -3.07421    0.30680 -10.020   <2e-16 ***
ndvi                                     -1.19992    2.56201  -0.468    0.640    
income15,000$ - 29,999$                  -0.33379    0.29920  -1.116    0.265    
income30,000$ - 49,999$                  -0.34885    0.26666  -1.308    0.191    
income50,000$ - 79,999$                  -0.12784    0.25124  -0.509    0.611    
income>80,000$                           -0.27255    0.27288  -0.999    0.318    
incomemissing                            -0.50010    0.31299  -1.598    0.110    
income<15,000$:normalize(ndvi)            0.40515    0.34139   1.187    0.235    
income15,000$ - 29,999$:normalize(ndvi)   0.17341    0.35933   0.483    0.629    
income30,000$ - 49,999$:normalize(ndvi)   0.02158    0.32280   0.067    0.947    
income50,000$ - 79,999$:normalize(ndvi)   0.39774    0.28697   1.386    0.166    
income>80,000$:normalize(ndvi)            0.06677    0.30087   0.222    0.824    
incomemissing:normalize(ndvi)                  NA         NA      NA       NA

所以在第一个模型中，类别“收入<15,000”是参考类别，而在第二个模型中，发生了一些不同的事情，我还不是很清楚。

score 0 · Accepted Answer

假设我们想对这个方程进行回归。

我们尝试使用model.matrix. 但是下面的结果说明了一些自动化问题。有没有更好的方法来实现它？. 更具体地说，假设 X_1 是一个连续变量，而 X_2 是一个虚拟变量。

基本上交互项的解释是相同的，除了主要项 X_2 将在 X_1 处于其平均值时进行评估。（见本文初稿）

这里有一些数据来说明我的观点：（这不是 glm 但我们可以将相同的方法应用于 glm）

library(car)
str(Prestige)
# some data cleaning
Prestige <- Prestige[!is.na(Prestige$type),] 

# interaction the usual way.
lm1 <- lm(income ~ education+ type + education:type, data = Prestige); summary(lm1)

# interacting with demeaned education
Prestige$education_ <- Prestige$education-mean(Prestige$education)

当使用正则公式方法时，事情并没有按照我们想要的方式发展。由于公式没有将任何变量作为参考

lm2 <- lm(income ~ education+ type + education_:type, data = Prestige); summary(lm2)

# Using model.matrix to shape the interaction
cusInt <- model.matrix(~-1+education_:type,data=Prestige)[,-1];colnames(cusInt)
lm3 <- lm(income ~ education+ type + cusInt, data = Prestige); summary(lm3)


compareCoefs(lm1,lm3,lm2)

结果在这里：

                         Est. 1  SE 1 Est. 2  SE 2 Est. 3  SE 3
(Intercept)                -1865  3682  -1865  3682   4280  8392
education                    866   436    866   436    297   770
typeprof                   -3068  7192   -542  1950   -542  1950
typewc                      3646  9274  -2498  1377  -2498  1377
education:typeprof           234   617                          
education:typewc            -569   885                          
cusInteducation_:typeprof                 234   617             
cusInteducation_:typewc                  -569   885             
typebc:education_                                      569   885
typeprof:education_                                    803   885
typewc:education_

所以基本上在使用 model.matrix 时，我们必须进行干预以设置参考变量。另外变量名前面还出现了一些custInt，所以当有很多表要比较时格式化结果是相当繁琐的。

r - 在 R 中强制逻辑模型中的参考类别

1 回答 1

Related

Reference