r - 使用 dplyr、tidyr、purrr 分组多列聚集

Question

我正在尝试收集分布在多个列中的两个不同变量的数据，这些变量由其他两个变量分组。这就是问题所在。我有几个基因，几个样本。每个样本具有三种不同的可能基因型，每种都有相关的频率。我想整理一下以获得基因、样本、基因型、频率的单列。

我对此有一个 hackjob 解决方案，其中涉及创建列表列，传播这些列，然后使用 purrr::map 函数提取列。它很丑陋，不是真正可扩展的，并且频率在转换回数字之前被转换为字符，并不理想。

有没有更好的方法来解决这个问题？

library(tidyverse) 
# or, separately load dplyr, tibble, tidyr, purrr

# Here's what I have
have <- data_frame(gene=rep(c("gX", "gY"), each=2),
                   sample=rep(c("s1", "s2"), 2),
                   genotype1=c("AA", "AA", "GG", "GG"),
                   genotype2=c("AC", "AC", "GT", "GT"),
                   genotype3=c("CC", "CC", "TT", "TT"),
                   freq1=c(.8,.9, .7, .6),
                   freq2=c(.15,.1, .2, .35),
                   freq3=c(.05,0, .1, .05))
have
#> # A tibble: 4 × 8
#>    gene sample genotype1 genotype2 genotype3 freq1 freq2 freq3
#>   <chr>  <chr>     <chr>     <chr>     <chr> <dbl> <dbl> <dbl>
#> 1    gX     s1        AA        AC        CC   0.8  0.15  0.05
#> 2    gX     s2        AA        AC        CC   0.9  0.10  0.00
#> 3    gY     s1        GG        GT        TT   0.7  0.20  0.10
#> 4    gY     s2        GG        GT        TT   0.6  0.35  0.05

# Here's what I want. 
# Do a multicolumn gather grouped by gene and sample
want <- have %>%
  group_by(gene, sample) %>%
  summarize(x1=list(c(genotype=genotype1, freq=freq1)),
            x2=list(c(genotype=genotype2, freq=freq2)),
            x3=list(c(genotype=genotype3, freq=freq3))) %>%
  ungroup() %>%
  gather(key, value, x1, x2, x3) %>%
  mutate(genotype=map_chr(value, "genotype"),
         freq=map_chr(value, "freq") %>% as.numeric) %>%
  select(-key, -value) %>%
  arrange(gene, sample, genotype)
want
#> # A tibble: 12 × 4
#>     gene sample genotype  freq
#>    <chr>  <chr>    <chr> <dbl>
#> 1     gX     s1       AA  0.80
#> 2     gX     s1       AC  0.15
#> 3     gX     s1       CC  0.05
#> 4     gX     s2       AA  0.90
#> 5     gX     s2       AC  0.10
#> 6     gX     s2       CC  0.00
#> 7     gY     s1       GG  0.70
#> 8     gY     s1       GT  0.20
#> 9     gY     s1       TT  0.10
#> 10    gY     s2       GG  0.60
#> 11    gY     s2       GT  0.35
#> 12    gY     s2       TT  0.05

score 6 · Accepted Answer

您可以使用to_long()sjmisc -package，它一次收集多个列：

to_long(have, keys = "genos", values = c("genotype", "freq"),
       c("genotype1", "genotype2", "genotype3"),
       c("freq1", "freq2", "freq3"))

##  A tibble: 12 × 5
##     gene sample     genos genotype  freq
##    <chr>  <chr>     <chr>    <chr> <dbl>
## 1     gX     s1 genotype1       AA  0.80
## 2     gX     s2 genotype1       AA  0.90
## 3     gY     s1 genotype1       GG  0.70
## 4     gY     s2 genotype1       GG  0.60
## 5     gX     s1 genotype2       AC  0.15
## 6     gX     s2 genotype2       AC  0.10
## 7     gY     s1 genotype2       GT  0.20
## 8     gY     s2 genotype2       GT  0.35
## 9     gX     s1 genotype3       CC  0.05
## 10    gX     s2 genotype3       CC  0.00
## 11    gY     s1 genotype3       TT  0.10
## 12    gY     s2 genotype3       TT  0.05

to_long()需要键列和值列的名称，后跟应收集的每个向量的多个列名称。

score 1 · Accepted Answer

完整tidyverse的方法：

want <- have %>%
     gather(variable, value, -gene, -sample) %>% 
     mutate(group = parse_number(variable),
            variable = str_extract(variable,"\\D+")) %>% 
     spread(variable, value) %>% 
     select(-group)

r - 使用 dplyr、tidyr、purrr 分组多列聚集

2 回答 2

Related

Reference