1

我有两个数据框,dfa并且dfb

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

看起来像这样:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa是我感兴趣的基因列表:我想保留它们出现的dfb,注意数字(MUC1不是)。我的应该是这样的: MUC16new_df

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

我的问题是常规dplyr::semi_join()确实完全匹配,这没有考虑到dfb$gene_names可以包含用 . 分隔的基因这一事实"; "。意思是用这个例子,"MET"不保留。

我试图调查fuzzyjoin::regex_semi_join,但我不能让它做我想要的......

欢迎使用 tidyverse 解决方案。(也许有stringr?!)

编辑:后续问题...

我将如何进行倒数anti_join?简单地更改semi_joinanti_join这种方法是行不通的,因为该行在MET; BLEP不应该出现的时候出现了......

filter(gene_name == new_col)在使用提供的简单数据集之后添加一个anti_join,但如果我像这样扭曲它:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
  id = c(6:10)
)

……然后就没有了。在这里和我的真实数据集中,dfa不包含分号,它只是一列单个基因名称。但是dfb包含了很多信息,以及分号的多种组合...

4

3 回答 3

3

您可以seperate_rows()在加入之前使用拆分数据框。请注意,如果BLEP存在于 dfa 中,则会导致重复,这就是使用 distinct 的原因

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5),
  stringsAsFactors = FALSE
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10),
  stringsAsFactors = FALSE
)


library(tidyverse)

dfb%>%
  mutate(new_col = gene_name)%>%
  separate_rows(new_col,sep = "; ")%>%
  semi_join(dfa,by = c("new_col" = "gene_name"))%>%
  select(gene_name,id)%>%
  distinct()


于 2019-11-13T15:10:59.277 回答
0

这是使用stringrand的解决方案purrr

library(tidyverse)

dfb %>%
 mutate(gene_name_list = str_split(gene_name, "; ")) %>%
 mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
 filter(gene_of_interest == TRUE) %>%
 select(gene_name, id)
于 2019-11-13T16:24:01.833 回答
0

我想我终于设法fuzzyjoin::regex_joins做我想做的事了。这非常简单,我只需要调整我的dfa过滤器列表:

library(fuzzyjoin)

# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")

# to keep genes from dfb that are present in the dfa filter list
dfb %>% 
  regex_semi_join(dfa, by = c(gene_name = "gene_name"))

# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>% 
  regex_anti_join(dfa, by = c(gene_name = "gene_name"))

一个缺点:它很慢......

于 2019-11-18T17:17:20.490 回答