r - 如何使用 sqlite 数据库填充 bigstatsr::FBM 以供以后使用？

Question

我是 bigstatsr 软件包的新手。我有一个 sqlite 数据库，我想将其转换为 40k 行（基因）60K 列（样本）的 FBM 矩阵以供以后使用。我找到了如何用随机值填充矩阵的示例，但我不确定用我的 sqlite 数据库中的值填充矩阵的最佳方法是什么。

目前我按顺序执行，这里有一些模拟代码：

library(bigstatsr)
library(RSQLite)
library(dplyr)

number_genes <- 50e3
number_samples <- 70e3

large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes, 
                                       ncol = number_samples, 
                                       type = "double", 
                                       backingfile = "fbm_large_genomic_matrix")

# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")


sample_index_counter <- 1

for(current_sample in vector_with_sample_names){
  
  sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
    dplyr::tbl("genomic_data") %>%
    dplyr::filter(sample == current_sample) %>% 
    dplyr::collect()
  
  large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
  sample_index_counter <- sample_index_counter + 1
  
}

big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())

我有两个问题：

有没有办法更有效地填充矩阵？不确定是否可以在这里使用 big_apply，也许是 foreach
我是否总是必须使用 big_write 以便稍后加载我的矩阵？如果是这样，为什么我不能只使用 bk 文件？

提前致谢

score 1 · Accepted Answer

这是您自己进行的非常好的第一次尝试。

这里效率低下的是dplyr::filter(sample == current_sample)对每个样本进行测试。我会尝试先使用match()来获取索引。然后，有点低效的是单独填充每一列。正如你所说，你可以用big_apply()块来做到这一点。
big_write()用于将 FBM 写入某个文本文件（例如 csv）。您要在这里使用FBM()$save()（自述文件中示例的第二行），然后big_attach()在 .rds 文件（自述文件的下一行）上使用。

r - 如何使用 sqlite 数据库填充 bigstatsr::FBM 以供以后使用？

1 回答 1

Related

Reference