r - 基于多列隔离数据区域

Question

我有一个数据集，在这里缩写：

SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927

其中 PPA 是关联的后验概率。由于我有一些高 log10 值 (>6)，我想确定这些区域周围的可信区间，以确定它们的大小。

为此，我首先想识别 log10 > 6 的 SNP，这很简单，使用子集。

newdata <- subset(data, log10 > 6)

但是，我还想在这个子集中包括物理上靠近这些先导 SNP 的 SNP，使用 BP 500 +/- 先导 SNP 的 BP（log10>6）。在这里，我不确定最好的方法。这是我可以研究的，subset还是我应该首先在我的原始数据中识别这些主要 SNP，然后从那里子集？

一旦我隔离了这些区域，我就可以继续前进。

任何建议表示赞赏！

score 6 · Accepted Answer

s <- read.table(header=T, text="SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927")

到 s$log10 > 6 的任何行的距离：

outer(s$BP[s$log10 > 6], s$BP, '-')
##       [,1]  [,2]  [,3]  [,4]   [,5]   [,6]
## [1,]     6     0 21610 36070 -13011 -13311
## [2,] 13317 13311 34921 49381    300      0

上面任何绝对值 < 500 的列都是您要保留的列：

s[apply(outer(s$BP[s$log10 > 6], s$BP, '-'), 2, function(x) any(abs(x) < 500)),]
##       SNP chr    BP    log10   PPA
## 1 rs10068  17 56555  1.16303 0.030
## 2 rs10032  17 56561 26.36400 0.975
## 5 rs10457  17 69572 -0.38403 0.014
## 6 rs10465  17 69872  8.19547 0.927

score 1 · Accepted Answer

对于它的价值，这是一个使用GenomicRanges包的解决方案。这个包建立在IRanges并使用非常有效地处理区间数据interval trees。这也应该能够处理所有不同的染色体。

首先加载数据：

# load data
df <- read.table(header=T, text="SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927")

获取 log10 > 6 的子集
```
df.f <- df[df$log10 > 6, ]
```

在初始数据和子集上加载和创建范围

require(GenomicRanges)
df.gr <- GRanges(Rle(df$chr), IRanges(df$BP, df$BP))
df.f.gr <- GRanges(Rle(df.f$chr), IRanges(df.f$BP, df.f$BP))

查找 log10 > 6 的所有数据与所有其他 SNP 的重叠，间隙 = +/- 500
```
olaps <- findOverlaps(df.f.gr, df.gr, maxgap=500)
```

获取输出

# if you just want a whole data.frame with ALL SNPs that have 
# log > 6 or within 500 bases of one with log > 6, then,        
out <- df[subjectHits(olaps), ] 

#       SNP chr    BP    log10   PPA
# 1 rs10068  17 56555  1.16303 0.030
# 2 rs10032  17 56561 26.36400 0.975
# 5 rs10457  17 69572 -0.38403 0.014
# 6 rs10465  17 69872  8.19547 0.927

# In case you want for each SNP that has log > 6, all of the 
# snps that are within 500 bases (either side) apart for 
# this SNP, separately, then,
out.list <- split(out, df$BP[queryHits(olaps)])
# $`56555`
#       SNP chr    BP    log10   PPA
# 1 rs10068  17 56555  1.16303 0.030
# 2 rs10032  17 56561 26.36400 0.975
#     
# $`56561`
#       SNP chr    BP    log10   PPA
# 5 rs10457  17 69572 -0.38403 0.014
# 6 rs10465  17 69872  8.19547 0.927

r - 基于多列隔离数据区域

2 回答 2

Related

Reference