0

我有一个如下数据框:

region        group        mid_pop   
  1             2            1146      
  2             4            1682        
  3             3            2891       
  4             1            7654       
  5             1            3289       
  6             2            1128       
  7             3            2121       
  8             4            3217      
  9             3            1616       
 10             1            1717              

我进行了多项回归并得到属于每个组的概率如下:

 mlogit <- multinom(group ~ mid_pop)
 probs <- predict(mlogit, type="probs")  

   probs1    probs2   probs3    probs4     
     0.2       0.3     0.4       0.1        
     0.3       0.4     0.15      0.15       
     0.4       0.1     0.3       0.2        
     0.7       0.1     0.1       0.1        
     0.2       0.3     0.4       0.1        
     0.6       0.1     0.1       0.2        
     0.7       0.1     0.1       0.1        
     0.3       0.2     0.1       0.4        
     0.2       0.1     0.1       0.6        
     0.1       0.2     0.1       0.6   

然后我为每个区域创建了一个权重。权重是“属于第一组的概率除以属于该区域所在的当前组的概率”。然后将权重乘以 mid_pop。

region        group        mid_pop    weight      mid_pop(weighted)
  1             2            1146      0.66           756.36
  2             4            1682       2              3364
  3             3            2891       2              5782
  4             1            7654       0.7            5357.8
  5             1            3289       0.2            657.8
  6             2            1128       0.3            338.4
  7             3            2121       0.7            1484.7
  8             4            3217       0.75           2412.75
  9             3            1616       0.33           533.28
 10             1            1717       0.16           274.72     

现在我想为组做一个标准化的平均差,看看加权前后 mid_pop 的平均值之间的差异。结果将是这样的:

SDM (group 1 vs. group 2)=....
SDM (group 1 vs. group 3)=....
SDM (group 1 vs. group 4)= ....

任何人都可以帮助我们做到这一点?提前致谢。

4

1 回答 1

1

图书馆 group_by的使用tidyverse

library(tidyverse)

df <-
  tibble(
    region = 1:10,
    group = c(2, 4, 3, 1, 1, 2, 3, 4, 3, 1),
    mid_pop = c(1146, 1682, 2891, 7654, 3289, 1128, 2121, 3217, 1616, 1717)
  ) # your data set

weight <- c(.66, 2, 2, .7, .2, .3, .7, .75, .33, .16)

df_wt <-
  df %>%
  bind_cols(weight = weight) %>%
  mutate(weighted = mid_pop * weight) %>% # your second data set: mid_pop(weighted)
  group_by(group) %>%
  summarise(pop = mean(weighted)) # average

## > df_wt
## # A tibble: 4 x 2
##   group   pop
##   <dbl> <dbl>
## 1     1 2097.
## 2     2  547.
## 3     3 2600.
## 4     4 2888.

outer带有"-"操作的函数可能会产生成对的差异

wt_pop <- df_wt %>% select(pop) %>% pull()

outer(wt_pop, wt_pop, "-") # symmetric matrix for the answer

##            [,1]     [,2]       [,3]       [,4]
## [1,]     0.0000 1549.393  -503.2200  -791.6017
## [2,] -1549.3933    0.000 -2052.6133 -2340.9950
## [3,]   503.2200 2052.613     0.0000  -288.3817
## [4,]   791.6017 2340.995   288.3817     0.0000

或者,您可以连续申请outer
您需要将其更改为具有以下功能的数据框as.data.frame()

df %>%
  bind_cols(weight = weight) %>%
  mutate(weighted = mid_pop * weight) %>%
  group_by(group) %>%
  summarise(pop = mean(weighted)) %>%
  do(outer(.$pop, .$pop, "-") %>% as_tibble())

## # A tibble: 4 x 4
##       V1    V2     V3     V4
##    <dbl> <dbl>  <dbl>  <dbl>
## 1     0  1549.  -503.  -792.
## 2 -1549.    0  -2053. -2341.
## 3   503. 2053.     0   -288.
## 4   792. 2341.   288.     0 
于 2018-10-13T01:55:07.750 回答