r - R：稀疏？为共现矩阵转换数据

Question

我是一名生物专业的学生，使用 R 生成一些可视化，显示哪些人类蛋白质（uniprots）被不同的细菌菌株靶向。

# sample data
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
                    "P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
                    "Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
                    "P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "C"), each = 5)
final <- cbind(human.uniprots, strains)

我正在尝试生成一个共现矩阵/热图......类似于

h.map <- data.frame(matrix(nrow = length(unique(human.uniprots)),
ncol = length(unique(strains)) + 1))
h.map.cols <- c("human_uniprots", "A", "B", "C")
colnames(h.map) <- h.map.cols

...其中列有菌株，行有蛋白质，数据框单元格填充了蛋白质与菌株相互作用的次数。因此，如果应变 A、B 和 C 都与一个 uniprot 相互作用，那么它们在该 uniprot 行的单元格中的值都应该为 3。

我已经尝试制作唯一应变和human_uniprots的元组列表，然后从我要填充的矩阵中搜索与应变和人类uniprot对匹配的元组，如果匹配则添加“1”......但是我不确定如何在 R 中使用元组。然后我看到了这个：Populating a co-occurrence matrix

这就是我想要的，但我不了解用法或语法...... sparse() 甚至是 R 中的函数吗？

此外……最好按照与所有菌株相互作用的蛋白质对所有蛋白质进行排名。因此，与所有菌株相互作用的所有蛋白质都应位于顶部，然后是与 2 个菌株相互作用的蛋白质，然后是 1 个菌株……

score 1 · Accepted Answer

使用dplyr、group_by、count和spread来获得每个菌株的计数。然后用该行的总计数替换每个应变计数，使用rowSums()：

library(dplyr)

as.data.frame(final) %>%
  group_by(human.uniprots, strains) %>%
  count() %>%
  spread(strains, n) %>%
  ungroup() %>%
  mutate(total_n = rowSums(.[2:ncol(.)])) %>%
  mutate_if(is.numeric, funs(ifelse(. == 0, 0, total_n))) %>%
  select(-total_n)

  # A tibble: 15 x 5
   human.uniprots     A     B     C     D
   <fct>          <dbl> <dbl> <dbl> <dbl>
 1 O95817            0.    0.    1.    0.
 2 P05067            0.    2.    0.    2.
 3 P0CG48            1.    0.    0.    0.
 4 P15311            2.    0.    0.    2.
 5 P26038            0.    0.    1.    0.
 6 P40763            0.    1.    0.    0.
 7 P42224            1.    0.    0.    0.
 8 P60709            0.    2.    0.    2.
 9 P61244            0.    0.    1.    0.
10 Q09472            0.    0.    1.    0.
11 Q8WYH8            1.    0.    0.    0.
12 Q9H160            0.    2.    0.    2.
13 Q9NXR8            1.    0.    0.    0.
14 Q9UDW1            0.    2.    0.    2.
15 Q9UKL0            0.    0.    1.    0.

score 1 · Accepted Answer

您可以使用来执行此操作table，或者如果您希望它稀疏，您可以使用xtabs.

因此，对于您的示例，您可以使用

tab <- table(final[,"human.uniprots"], final[,"strains"]) 
tab* rowSums(tab)

或稀疏

tab <- xtabs(~human.uniprots + strains, final, sparse=TRUE)
tab <- tab*Matrix::rowSums(tab)

然后您可以使用

Matrix::image(tab, scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
                               x=list(at=1:ncol(tab), label=colnames(tab))),
              ylab="uniprots",
              xlab="strains")

您还可以按出现次数对行进行排名

r <- order(-Matrix::rowSums(tab))

# and then reorder the rows of the matrix and the labels
Matrix::image(tab[r,],
              scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
                          x=list(at=1:ncol(tab), label=colnames(tab)[r])),
                  ylab="uniprots",
                  xlab="strains")

score 1 · Accepted Answer

sparse()从外观上看是一个 MATLAB 函数。您正在描述由关联矩阵表示的二分网络。

human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
                    "P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
                    "Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
                    "P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "D"), each = 5)
final <- cbind(human.uniprots, strains)

final_df <- as.data.frame(final)

library(igraph) # install.packages("igraph")
g <- graph_from_data_frame(final_df, directed = FALSE)
V(g)$type <- ifelse(V(g)$name %in% strains, FALSE, TRUE)

as_incidence_matrix(g)
#>   P15311 P0CG48 Q8WYH8 P42224 Q9NXR8 P40763 P05067 P60709 Q9UDW1 Q9H160
#> A      1      1      1      1      1      0      0      0      0      0
#> B      0      0      0      0      0      1      1      1      1      1
#> C      0      0      0      0      0      0      0      0      0      0
#> D      1      0      0      0      0      0      1      1      1      1
#>   Q9UKL0 P26038 P61244 O95817 Q09472
#> A      0      0      0      0      0
#> B      0      0      0      0      0
#> C      1      1      1      1      1
#> D      0      0      0      0      0

或者.....

V(g)$type <- ifelse(V(g)$name %in% strains, TRUE, FALSE)
                                        # swap TRUE/FALSE

as_incidence_matrix(g)
#>        A B C D
#> P15311 1 0 0 1
#> P0CG48 1 0 0 0
#> Q8WYH8 1 0 0 0
#> P42224 1 0 0 0
#> Q9NXR8 1 0 0 0
#> P40763 0 1 0 0
#> P05067 0 1 0 1
#> P60709 0 1 0 1
#> Q9UDW1 0 1 0 1
#> Q9H160 0 1 0 1
#> Q9UKL0 0 0 1 0
#> P26038 0 0 1 0
#> P61244 0 0 1 0
#> O95817 0 0 1 0
#> Q09472 0 0 1 0

由reprex 包（v0.2.0）于 2018 年 5 月 25 日创建。

r - R：稀疏？为共现矩阵转换数据

3 回答 3

Related

Reference