r - 连接数据框中具有相似（但不相等）值的行

Question

我有一个df赞：

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

而且我想加入共享相同chr并且strand具有相似起点和终点的行（比如100 +/-距离）。对于执行行连接的那些列，我还想连接SampleID名称和Value. 对于前面的示例，类似于：

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

想法？谢谢！

编辑：

我找到了 R 的模糊连接包（https://cran.r-project.org/web/packages/fuzzyjoin/index.html）。有没有人有这个包的经验？

编辑2：

如果只有一个变量 (SampleID或Value) 将被连接起来，那也很好。

score 1 · Accepted Answer

我们可以按'Chr'，'Strand'分组，根据'Start'和'End'列中相邻元素之间的差异创建分组ID order，然后按'Start'，'End'分组，然后按'Chr'分组, 'Strand' 和 'ind', 获取'Start', 'End'的第一个元素，同时paste读取'SampleID' 和'Value' 列中的元素

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

注意：假设 'df' 是data.table

r - 连接数据框中具有相似（但不相等）值的行

1 回答 1

Related

Reference