r - Block bootstrap for genomic data

Question

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.

My data.frame has the following structure:

CHR POS var_A var_B
1 192 0.9 0.7
1 2000  0.8 0.3
2 3 0.21  0.76 
2 30009 0.36  0.15
...

The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).

The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.

I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.

x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
  df=freq[0,]
  for (i in unique(freq$CHR)) {
    t = freq[freq$CHR==i,]
    fim = t[nrow(t),2]
    i = t[1,2]
    f = i + x
    while(f < fim) {
      rows = which(t$POS>=i & t$POS<f)
      s = sample(rows)
      df = rbind(df,t[s,])
      i = f
      f = f + x
    }
  }
  cors = c(cors, cor(df$var_A, df$var_B))
}

Could anybody help me out? I am sure there is a more efficient way of doing this.

Thank you in advance.

score 1 · Accepted Answer

我希望我对你的理解是正确的：

# needed for round_any()
library(plyr)

res <- lapply(unique(freq$CHR),function(x){
  
  freq_sel <- freq[freq$CHR==x,]
  blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
  do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
  
})

这应该返回一个列表，其中包含每个染色体的条目。在每个条目中，如果存在，则每 1kb 块都有一个观察值。块数由最大值确定POS。

编辑：

library(doParallel)
library(foreach)
library(plyr)

cl <-  makeCluster(detectCores())
registerDoParallel(cl)


res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
  
  freq_sel <- freq[freq$CHR==x,]
  blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
  do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
  
}

stopCluster(cl)

foreach这是每个染色体上的简单并行化。重组函数并将并行处理基于另一个级别（例如 1000 次迭代或可能的块）可能会更好。在任何情况下，我都可以再次强调我在评论中所说的话：在您处理代码并行化之前，您应该确保它尽可能高效。这意味着您可能想要查看boot软件包或类似的东西以提高效率。也就是说，随着您计划的迭代次数，一旦您对自己的功能感到满意，并行处理可能会很有用。

score 1 · Accepted Answer

一种有效的尝试方法是使用“boot”包，其中的功能包括并行处理能力。

特别是，“tsboot”或时间序列引导函数将选择有序的数据块。如果您的POS变量是某种有序观察，这可能会起作用。

引导包功能很棒，但首先需要一点帮助。要在引导包中使用引导函数，必须首先将感兴趣的统计数据包装在包含index参数的函数中。这是引导程序生成的索引将用于将采样数据传递给您的统计数据的设备。

cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)

请注意cor_hat下面的论点。参数，sim = "fixed", l = 1000表明您需要fixed长度（l）块1000。但是，如果您试图捕捉随时间移动的最近邻动态，您可以制作任何大小的块，5 或 10。这个multicore论点不言自明，但如果您使用 Windows，它可能会“下雪”。

library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)

此外，Elements of Statistical Learning第 194 页提供了使用传统函数的框架的一个很好的示例boot，所有这些都与tsboot.

希望有帮助，祝你好运。

贾斯汀

r

score 0 · Accepted Answer

所以，过了一会儿，我想出了我的问题的答案。就这样吧。

你需要这个包dplyr。

l = 1000
teste = freq %>%
  mutate(w = ceiling(POS/l)) %>%
  group_by(CHR, w) %>%
  sample_n(1)

此代码创建一个w基于基因组位置 (POS) 命名的新变量。这个变量w是每一行被分配到的窗口，它取决于l窗口的长度。

您可以多次重复此代码，每次对每个窗口/CHR 采样一行（使用sample_n(1)）并应用您想要的任何感兴趣的统计信息。

r - Block bootstrap for genomic data

3 回答 3

编辑：

Related

Reference