I have a huge dataframe df
which includes information about overlapping intervals (A) and (B) and on which chromosome (chrom) they were located. There is also information about a value (level of gene expression) observed over interval (A).
chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732
Note that the same interval may appear several times, for instance, an interval (B) will have been reported two times if it overlapped with two (A) intervals:
Astart(1)=========================Aend1 Astart(2)========================Aend(2)
Bstart(1)=======================================Bend(1)
chrom value Astart Aend Bstart Bend
chr1 0 0 25 15 35 #A(1) and B(1) overlap
chr1 1 28 45 15 35 #A(2) and B(1) overlap
Likewise, an interval (A) will have been reported two or more times if it overlapped with two or more (B) intervals:
Astart(3)===================================================================Aend(3)
Bstart(2)=========Bend(2) Bstart(3)===========Bend(3) Bstart(4)===============Bend(4)
chrom value Astart Aend Bstart Bend
chr4 0 10 100 15 25 #A(3) and B(2) overlap
chr4 0 10 100 30 75 #A(3) and B(3) overlap
chr4 3 10 100 80 120 #A(3) and B(4) overlap
My goal is to output all the individual positions from intervals (B) and the corresponding values from (A). I have a piece of code that beautifully outputs all the relevant positions in (B):
position <- unlist(mapply(seq, ans$Bstart, ans$Bend - 1))
> head(position)
[1] 17408 17409 17410 17411 17412 17413
The problem with this is that it is not enough to retrieve the chromosome information back from there. I need to check chromosome information AND position at the same time when I list these positions. That is because the same position integer may occur on several chromosomes, so I can't afterwards just run something like for position %in% range(Astart, Aend) output $chrom, $value
(dummy code).
How can I retrieve (chrom, position, value)
at the same time?
The expected result would be something like this:
> head(expected_result)
chrom position value
chr1 17408 0
chr1 17409 0
chr1 17410 0
chr1 17411 0
chr1 17412 0
chr1 17413 0
#skipping some lines to show another part of the dataframe
chr11 93466856 4
chr11 93466857 4