r - 根据重要标准有效地合并两个数据帧

Question

昨晚回答这个问题data.frame，我花了一个小时试图找到一个没有在 for 循环中增长的解决方案，但没有任何成功，所以我很好奇是否有更好的方法来解决这个问题。

问题的一般情况归结为：

合并两个data.frames
任何一个中的条目都data.frame可以在另一个中有 0 个或多个匹配条目。
我们只关心在两者中具有 1 个或多个匹配项的条目。
匹配函数很复杂，涉及两个data.frames中的多个列

对于一个具体的例子，我将使用与链接问题类似的数据：

genes <- data.frame(gene       = letters[1:5], 
                    chromosome = c(2,1,2,1,3),
                    start      = c(100, 100, 500, 350, 321),
                    end        = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
                   chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
                   position   = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))

还有我们复杂的匹配函数：

# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
  return(
    marker$chromosome == gene$chromosome & 
    marker$postion >= (gene$start - 10) &
    marker$postion <= (gene$end + 10)
  )
}

对于 is 的条目，输出应该看起来像sql INNER JOIN两个 data.frames 中的isMatch一个TRUE。我试图构建这两个data.frames，以便在另一个中可以有 0 个或多个匹配项data.frame。

我想出的解决方案如下：

joined <- data.frame()
for (i in 1:nrow(genes)) {
   # This repeated subsetting returns the same results as `isMatch` applied across
   # the `markers` data.frame for each entry in `genes`.
   matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
   matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
   matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
   # matches may now be 0 or more rows, which we want to repeat the gene for:
   if(nrow(matches) != 0) {
     joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
   }
}

给出结果：

   gene chromosome start end marker position
1     a          2   100 200      3       96
2     a          2   100 200      4      206
3     b          1   100 200      1      105
4     b          1   100 200      5      150
5     b          1   100 200      9      120
51    e          3   321 567      6      400

这是一个非常丑陋和笨拙的解决方案，但我尝试的任何其他方法都失败了：

的使用apply，给了我一个list每个元素都是矩阵的地方，没有办法rbind。
我无法指定joinedfirst 的尺寸，因为我不知道最终需要多少行。

我相信我将来会想出这个一般形式的问题。那么解决此类问题的正确方法是什么？

score 4 · Accepted Answer

数据表解决方案：滚动连接以满足第一个不等式，然后进行向量扫描以满足第二个不等式。join-on-first-inequality 将有比最终结果更多的行（因此可能会遇到内存问题），但它会小于此 answer中的直接合并。

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

可以进行双重连接，但由于它涉及在第二次连接之前重新键入数据表，我认为它不会比上面的矢量扫描解决方案更快。

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

score 4 · Accepted Answer

我自己通过合并处理了一个非常相似的问题，然后整理出哪些行满足条件。我并不是说这是一个通用的解决方案，如果您正在处理大型数据集，其中匹配条件的条目很少，这可能会效率低下。但要使其适应您的数据：

joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
#    chromosome gene start end marker position
# 1           1    b   100 200      1      105
# 2           1    b   100 200      5      150
# 4           1    b   100 200      9      120
# 10          2    a   100 200      4      206
# 11          2    a   100 200      3       96
# 16          3    e   321 567      6      400

score 2 · Accepted Answer

我使用该sqldf软件包提出的另一个答案。

sqldf("SELECT gene, genes.chromosome, start, end, marker, position 
       FROM genes JOIN markers ON genes.chromosome = markers.chromosome 
       WHERE position >= (start - 10) AND position <= (end + 10)")

使用microbenchmark它的性能与@alexwhan 的merge和[方法相当。

> microbenchmark(alexwhan, sql)
Unit: nanoseconds
     expr min    lq median  uq  max neval
 alexwhan 435 462.5  468.0 485 2398   100
      sql 422 456.5  466.5 498 1262   100

我还尝试在一些与我所使用的格式相同的真实数据上测试这两个函数（ 35,000 行genes， 2,000,000 行markers，joined输出达到 480,000 行）。

不幸的是merge，似乎无法处理这么多数据，joined.raw <- merge(genes, markers)出现错误（如果减少行数，我不会得到）：

Error in merge.data.frame(genes, markers) : 
  negative length vectors are not allowed

而该sqldf方法在 29 分钟内成功运行。

score 0 · Accepted Answer

在你为我解决这个问题将近一年之后......现在我花了一些时间通过 awk 使用另一种方式来处理这个问题......

awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt

这会产生相同的结果：

b   1   100 200 1   1   105
a   2   100 200 3   2   96
a   2   100 200 4   2   206
b   1   100 200 5   1   150
e   3   321 567 6   3   400
b   1   100 200 9   1   120

r - 根据重要标准有效地合并两个数据帧

4 回答 4

Related

Reference