sql - 在 R 或 SQL 中具有间隔的最大每组参考 n

Question

我已经描述了下面描述的我的（非平凡的）问题。这是我的第一篇文章，现在是修改版。任何输入或建议的解决方案都会有所帮助。

这有几个方面：确定小规模问题的最佳解决方案（下面已经有几个建议）、时间（下面的 data.table 解决方案似乎选中了框）和内存管理。问题在于在一个表中枚举并由另一个表中的集群表示的标签（如果在同一链上的 30bp 内，则为同一集群）。

挑战在于确定将给定标签分配到适当间隔的有效过程。我们正在处理基因组数据，这意味着标签坐标由起始位置、结束位置（=起始位置 + 1）、染色体（完整数据集中的 25 个值）和链（位置在正链或负链上）确定双链 DNA）。因此，集群在同一条链上不重叠，但如果它们的间隔在不同的链上，集群坐标可能会重叠，这会使事情变得复杂。

这是我 1 月 9 日帖子的修改版本，更好地概括了问题的内在难度。稍后显示解决小规模问题的快速解决方案。如果有人想处理完整的数据集，请告诉我。提前谢谢了！

问候，

尼克克拉克

背景该问题涉及间隔和每组的最大 n。我有两个包含聚集基因坐标（簇）和输入数据（标签）的表。clusters 表包含来自 tags 表中同一链上每个覆盖的非重叠间隔的总和标签。完整的集群表有 160 万行。标签表大约有 400 万行，因此理想情况下应该对解决方案进行矢量化。请参阅下面的一些示例数据。该设置是关于人类转录起始位点 (CAGE) 的数据集。

当我在 R 中工作时，我正在寻找基于 R 或 SQL 的解决方案。我之前通过 R 中的 plyr 和 sqldf 包进行了不成功的尝试。

我所缺少的挑战是聚集表中的一行，它从与最大标签贡献相关的输入数据表中标识起始坐标。

请注意，1) 来自不同链的簇可以具有重叠坐标，2) chr / chr_clst 可以采用 25 个不同的值（示例中未显示），3) 解决方案需要同时考虑链和 chr / chr_clst。

我的理想解决方案： 矢量化 R 代码或对以下 SQL 语句的改进。下面的解决方案版本可以解决内存问题。就像改进的 sql 语句一样，它可以有效地从 clusters 表中确定适当的行。

到目前为止的状态 这是迄今为止最好的解决方案。向 user1935457 提供代码的提示和酷点以及后续建议修改的 mnel。这里的障碍是，由于对内存的过多需求，从玩具示例移动到填充比例表会使 R（和 R Studio）崩溃。

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
       list(IDX=.I[which.max(tag_count)]),
       by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

我最初使用 R 中的 sqldf 包创建了一个基本的 SQL 查询（这个版本找到了最大值，而不是与最大值关联的坐标）。尽管在两个表上都放置了（希望）适当的索引，但查询需要永远运行。

output_tablename <- sqldf(c(
"create index ai1 on clusters(chr_clst, start_clst, end_clst, strand_clst)",
"create index ai2 on tags(chr, start, end, strand)",
"select a.chr_clst, a.start_clst, a.end_clst, a.strand_clst, sum(b.tags)
from main.clusters a
inner join main.tags b on a.chr_clst=b.chr and a.strand_clst = b.strand 
and b.end between a.start_clst and a.end_clst
group by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst
order by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst"
))

表结构
簇：chr_clst、start_clst、end_clst、strand_clst、tags_clst。
标签：chr、开始、结束、链、tag_count。

R 格式的示例数据 如果有人想处理完整的数据集，请告诉我。

集群：

chr_clst <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_clst <- c(568911, 569233, 569454, 569793, 569877, 569926, 569972, 570048, 570166, 713987)
end_clst <- c(568941, 569256, 569484, 569803, 569926, 569952, 569973, 570095, 570167, 714049)
strand_clst <- c("+", "+", "+", "+", "+", "-", "+", "+", "+", "-")
tags_clst <- c(37, 4, 6, 3, 80, 25, 1, 4, 1, 46)

clusters <- data.frame(cbind(chr_clst, start_clst, end_clst, strand_clst, tags_clst))
clusters$start_clst <- as.numeric(as.character(clusters$start_clst))
clusters$end_clst <- as.numeric(as.character(clusters$end_clst))
clusters$tags_clst <- as.numeric(as.character(clusters$tags_clst))
rm(chr_clst, start_clst, end_clst, start_clst, strand_clst, tags_clst)

标签：

chr <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1")

start <- c(568911, 568912, 568913, 568913, 568914, 568916, 568917, 568918, 568929, 
568929, 568932, 568933, 568935, 568937, 568939, 568940, 568940, 569233, 569247, 
569255, 569454, 569469, 569471, 569475, 569483, 569793, 569802, 569877, 569880, 
569887, 569889, 569890, 569891, 569893, 569894, 569895, 569895, 569896, 569897, 
569898, 569898, 569899, 569900, 569900, 569901, 569901, 569903, 569905, 569906, 
569907, 569907, 569908, 569908, 569909, 569910, 569910, 569911, 569911, 569912, 
569914, 569914, 569915, 569916, 569917, 569918, 569919, 569920, 569920, 569925, 
569926, 569936, 569938, 569939, 569939, 569940, 569941, 569941, 569942, 569942, 
569943, 569944, 569948, 569948, 569951, 569972, 570048, 570057, 570078, 570094, 
570166, 713987, 713989, 713995, 714001, 714001, 714007, 714008, 714010, 714011, 
714011, 714011, 714013, 714015, 714015, 714017, 714018, 714019, 714023, 714025, 
714029, 714034, 714034, 714037, 714038, 714039, 714039, 714040, 714042, 714048, 
714048)

end <- c(568912, 568913, 568914, 568914, 568915, 568917, 568918, 568919, 568930, 
568930, 568933, 568934, 568936, 568938, 568940, 568941, 568941, 569234, 569248,
569256, 569455, 569470, 569472, 569476, 569484, 569794, 569803, 569878, 569881, 
569888, 569890, 569891, 569892, 569894, 569895, 569896, 569896, 569897, 569898, 
569899, 569899, 569900, 569901, 569901, 569902, 569902, 569904, 569906, 569907, 
569908, 569908, 569909, 569909, 569910, 569911, 569911, 569912, 569912, 569913, 
569915, 569915, 569916, 569917, 569918, 569919, 569920, 569921, 569921, 569926, 
569927, 569937, 569939, 569940, 569940, 569941, 569942, 569942, 569943, 569943, 
569944, 569945, 569949, 569949, 569952, 569973, 570049, 570058, 570079, 570095, 
570167, 713988, 713990, 713996, 714002, 714002, 714008, 714009, 714011, 714012, 
714012, 714012, 714014, 714016, 714016, 714018, 714019, 714020, 714024, 714026, 
714030, 714035, 714035, 714038, 714039, 714040, 714040, 714041, 714043, 714049, 
714049)

strand <- c("+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "+", "+", "+", "+", "+", "+", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-")

tag_count <- c(1, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 1, 2, 1, 6, 2, 8, 1, 1, 2, 1, 1, 2, 
1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 4, 2, 4, 2, 4, 
1, 1, 1, 1, 3, 2, 1, 3, 1, 2, 3, 1, 1, 3, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 1, 2, 2, 
4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 2, 4, 2, 1, 1, 1, 
2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 2)

tags <- data.frame(cbind(chr, start, end, strand, tag_count))    
tags$start <- as.numeric(as.character(tags$start))
tags$end <- as.numeric(as.character(tags$end))
tags$tag_count <- as.numeric(as.character(tags$tag_count))
rm(chr, start, end, strand, tag_count)

score 2 · Accepted Answer

这是一个包的尝试data.table：

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
           list(IDX=.I[which.max(tag_count)]),
           by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

最后一行的输出：

     chr strand start_clst end_clst tags_clst  start    end tag_count
 1: chr1      -     569926   569952        25 569942 569943         4
 2: chr1      -     713987   714049        46 714011 714012         4
 3: chr1      +     568911   568941        37 568940 568941         8
 4: chr1      +     569233   569256         4 569255 569256         2
 5: chr1      +     569454   569484         6 569471 569472         2
 6: chr1      +     569793   569803         3 569793 569794         2
 7: chr1      +     569877   569926        80 569925 569926         5
 8: chr1      +     569972   569973         1 569972 569973         1
 9: chr1      +     570048   570095         4 570048 570049         1
10: chr1      +     570166   570167         1 570166 570167         1

编辑

基于下面评论中讨论的内存问题，这是另一种尝试。我使用intervals包来查找两个表之间的重叠间隔。您还可以探索并行化for循环以提高速度。

require(data.table)
require(intervals)
clusters <- data.table(clusters)
tags <- data.table(tags)

#  Find all unique combinations of chr and strand...
setkey(clusters, chr_clst, strand_clst)
setkey(tags, chr, strand)

unique.keys <- unique(rbind(clusters[, key(clusters), with=FALSE],
                            tags[, key(tags), with=FALSE], use.names=FALSE))

# ... and then work on each pair individually to avoid creating
# enormous objects in memory
result.list <- vector("list", nrow(unique.keys))
for(i in seq_len(nrow(unique.keys))) {
  tmp.clst <- clusters[unique.keys[i]]
  tmp.tags <- tags[unique.keys[i]]

  # Keep track of each row for later
  tmp.clst[, row.id := seq_len(nrow(tmp.clst))]
  tmp.tags[, row.id := seq_len(nrow(tmp.tags))]

  # Use intervals package to find all overlapping [start, end] 
  # intervals between the two tables
  clst.intervals <- Intervals(tmp.clst[, list(start_clst, end_clst)],
                              type = "Z")
  tags.intervals <- Intervals(tmp.tags[, list(start, end)],
                              type = "Z")
  rownames(tags.intervals) <- tmp.tags$row.id

  # This goes to C++ code in intervals package; 
  # I didn't spend too much time looking over how it works
  overlaps <- interval_overlap(tags.intervals,
                               clst.intervals,
                               check_valid = FALSE)

  # Retrieve rows from clusters table with overlaps and add a column
  # indicating which intervals in tags table they overlapped with
  matches <- lapply(as.integer(names(overlaps)), function(n) {
    ans <- tmp.clst[overlaps[[n]]]
    ans[, match.in.tags := n]
  })

  # List back to one table...
  matches <- rbindlist(matches)

  # ... and join each match from tags to its relevant row from tags
  setkey(matches, match.in.tags)
  setkey(tmp.tags, row.id)

  # add the rows for max of tag_count by start_clst and
  # end_clst from this particular unique key to master list...
  result.list[[i]] <- tmp.tags[matches][, .SD[which.max(tag_count)],
                                        by = list(start_clst, end_clst)]
}

# and concatenate master list into none table,
# getting rid of the helper columns
rbindlist(result.list)[, c("row.id", "row.id.1") := NULL][]

最后一行给出：

    start_clst end_clst  chr strand  start    end tag_count chr_clst strand_clst tags_clst
 1:     569926   569952 chr1      - 569942 569943         4     chr1           -        25
 2:     713987   714049 chr1      - 714011 714012         4     chr1           -        46
 3:     568911   568941 chr1      + 568940 568941         8     chr1           +        37
 4:     569233   569256 chr1      + 569255 569256         2     chr1           +         4
 5:     569454   569484 chr1      + 569471 569472         2     chr1           +         6
 6:     569793   569803 chr1      + 569793 569794         2     chr1           +         3
 7:     569877   569926 chr1      + 569925 569926         5     chr1           +        80
 8:     569972   569973 chr1      + 569972 569973         1     chr1           +         1
 9:     570048   570095 chr1      + 570048 570049         1     chr1           +         4
10:     570166   570167 chr1      + 570166 570167         1     chr1           +         1

score 2 · Accepted Answer

只是在其他答案和评论的基础上提供一些提示的快速答案。

如果X[Y]（或merge(X,Y)）返回大量行，大于max(nrow(X),nrow(Y))（nrow(X)*nrow(Y)例如）然后X[Y][where]（即X[Y]后跟一个子集）将无济于事。最终的结果要小得多，但它必须先创建大的X[Y]。

如果需要范围，那么一种方式是w = X[Y,roll=TRUE,which=TRUE]或w=X[Y,mult="first",which=TRUE]类似的方式，可能是第一次和最后一次两次。w获得每个范围的行位置 ( ) 后，您可以seq或vecseq在开始和结束之间，然后选择结果。此标签中的其他 SO 问题中有一些示例。当然，将它构建到 data.table 中会很好，并且有一个功能请求建议连接列本身可以是 2 列列表列，其中包含每行每列的范围查询的边界。

或者，可以使用by-without-by 。这是在没有子句时j对每一行进行评估的地方。搜索by-without-by 并查看示例。这就是您可以坚持笛卡尔然后子集思维的方式，而无需首先实际创建整个笛卡尔结果。类似的东西：。不过，由于 3 个向量扫描（和），这可能比or方法慢。但至少它会避免大量的内存分配。请注意，前缀可用于显式引用非连接列。data.table 中的函数可以用 C 编码以更有效地执行此操作，类似于iby?data.tableX[Y,.SD[start<=i.value & i.value<=end]]rollmult&<=<=i.ibetweenclampRcpp 中的函数。但目前between()写为 R 矢量扫描，因此同样慢。

希望有帮助。我试图解释当前的想法，不管它是对是错。

我们将改进 data.table 以捕获带有优雅错误的笛卡尔分配，并给出评论中提到的一些提示 [编辑：allow.cartesian=FALSE现在在 v1.8.7 中添加]。谢谢！

扩展第 2 段：

setkey(clusters,chr,strand,end_clst)
setkey(tags,chr,strand,end)

begin = tags[clusters[,list(chr,strand,start_clst)],roll=-Inf,mult="first",which=TRUE]
end = tags[clusters[,list(chr,strand,end_clst)],roll=+Inf,mult="last",which=TRUE]

idx = mapply(function(x,y){.i=seq.int(x,y); .i[ which.max(tags$tag_count[.i]) ]}, begin, end)
cbind(clusters, tags[idx])
     chr start_clst end_clst strand tags_clst  chr  start    end strand tag_count
 1: chr1     569926   569952      -        25 chr1 569942 569943      -         4
 2: chr1     713987   714049      -        46 chr1 714011 714012      -         4
 3: chr1     568911   568941      +        37 chr1 568940 568941      +         8
 4: chr1     569233   569256      +         4 chr1 569255 569256      +         2
 5: chr1     569454   569484      +         6 chr1 569471 569472      +         2
 6: chr1     569793   569803      +         3 chr1 569793 569794      +         2
 7: chr1     569877   569926      +        80 chr1 569925 569926      +         5
 8: chr1     569972   569973      +         1 chr1 569972 569973      +         1
 9: chr1     570048   570095      +         4 chr1 570048 570049      +         1
10: chr1     570166   570167      +         1 chr1 570166 570167      +         1

这避免了其他答案和评论中提到的笛卡尔内存分配问题。它在 v1.8.7 中使用了以下新功能：

o 除了TRUE/ FALSE，roll现在可以是正数（前滚/LOCF）或负数（后滚/NOCB）。有限的数字限制了值滚动的距离（有限的陈旧性）。roll=TRUE并且roll=+Inf是等价的。
rollends是一个包含两个逻辑的新参数。rollends[1]如果是，则第一个观察结果向后滚动TRUE。rollends[2]如果是，则最后一个观察结果向前滚动TRUE。如果roll是一个有限数，则同样的限制适用于两端。

score 1 · Accepted Answer

这是一个使用的建议apply：

transform(
  clusters,
  start = apply(clusters[c("chr_clst", "start_clst", "end_clst", "strand_clst")],
                1, function(x) {
                     tmp <- tags[tags$start >= as.numeric(x[2]) &
                                 tags$end <= as.numeric(x[3]) & 
                                 tags$chr == x[1] & 
                                 tags$strand == x[4], c("tag_count", "start")]
                     tmp$start[which.max(tmp$tag_count)]}))

基本上，对于函数的每一行，都会在的相关子集中clusters查找的最大值。选择了适当的值。这些新的值向量用作的新列。tag_counttagsstartstartclusters

结果：

   chr_clst start_clst end_clst strand_clst tags_clst  start
1      chr1     568911   568941           +        37 568940
2      chr1     569233   569256           +         4 569255
3      chr1     569454   569484           +         6 569471
4      chr1     569793   569803           +         3 569793
5      chr1     569877   569926           +        80 569925
6      chr1     569926   569952           -        25 569942
7      chr1     569972   569973           +         1 569972
8      chr1     570048   570095           +         4 570048
9      chr1     570166   570167           +         1 570166
10     chr1     713987   714049           -        46 714011

score 1 · Accepted Answer

这可以使用以下foverlaps()功能非常data.table有效地完成v1.9.4：

require(data.table) #v1.9.4+
setDT(clusters, key=c("chr_clst", "strand_clst", "start_clst", "end_clst"))
setDT(tags, key=c("chr", "strand", "start", "end"))

ans = foverlaps(clusters, tags)[, .SD[which.max(tag_count)], by=.(chr_clst, strand_clst, start_clst, end_clst)]

#     chr_clst strand_clst start_clst end_clst  start    end tag_count tags_clst
#  1:     chr1           -     569926   569952 569942 569943         4        25
#  2:     chr1           -     713987   714049 714011 714012         4        46
#  3:     chr1           +     568911   568941 568940 568941         8        37
#  4:     chr1           +     569233   569256 569255 569256         2         4
#  5:     chr1           +     569454   569484 569471 569472         2         6
#  6:     chr1           +     569793   569803 569793 569794         2         3
#  7:     chr1           +     569877   569926 569925 569926         5        80
#  8:     chr1           +     569972   569973 569972 569973         1         1
#  9:     chr1           +     570048   570095 570048 570049         1         4
# 10:     chr1           +     570166   570167 570166 570167         1         1

foverlaps()最终也将能够执行重叠范围连接，而无需先键入它们，类似于新的on=(v1.9.6+) 参数。

sql - 在 R 或 SQL 中具有间隔的最大每组参考 n

4 回答 4

编辑

Related

Reference