我有一个 Rcpp 函数,它读取大型 BAM 文件(1-20GB,使用htslib
)并创建几个非常长std::vector
的 s(最多 80M 个元素)。在阅读之前不知道元素的数量,所以我不能使用Rcpp::IntegerVector
and Rcpp::CharacterVector
。据我了解,当我Rcpp::wrap
进一步使用它们时,会创建副本。在这种情况下,有没有办法加快数据从 C++ 到 R 的传输?是否有可以在 Rcpp 函数中创建的数据结构,对push_back
元素尽可能快std::vector
,并通过引用传递给 R?
以防万一,这是我目前创建它们的方式:
std::vector<std::string> seq, xm;
std::vector<int> rname, strand, start;
以下是我包装和退回它们的方式:
Rcpp::IntegerVector w_rname = Rcpp::wrap(rname);
w_rname.attr("class") = "factor";
w_rname.attr("levels") = chromosomes; // chromosomes contain names of the reference sequences from BAM
Rcpp::IntegerVector w_strand = Rcpp::wrap(strand);
w_strand.attr("class") = "factor";
w_strand.attr("levels") = strands; // std::vector<std::string> strands = {"+", "-"};
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start,
Rcpp::Named("seq") = seq,
Rcpp::Named("XM") = xm
);
return(res);
编辑 1(2021.10.19):
感谢大家的意见,我需要更多的时间来检查是否stringfish
可以使用,但我从 cpp11 包 vignettes 中运行了一个稍微修改的测试,以将其与std::vector
. 这是代码和结果(表明它std::vector<int>
仍然更快,尽管它必须Rcpp::wrap
在返回时被 ped):
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<int> stdint_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<int> x;
R_xlen_t i = 0;
while (i < n) {
x.push_back(i++);
}
return x;
}')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:7), pkg = c("cpp11", "stdint"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 100 cpp11 1.9µs 1.89KB 9999 1
2 1000 cpp11 6.1µs 16.03KB 9999 1
3 10000 cpp11 58.11µs 256.22KB 7267 12
4 100000 cpp11 488.15µs 2MB 815 11
5 1000000 cpp11 4.34ms 16MB 88 14
6 10000000 cpp11 97.39ms 256MB 4 5
7 100 stdint 1.6µs 2.93KB 10000 0
8 1000 stdint 3.36µs 6.45KB 9998 2
9 10000 stdint 19.87µs 41.6KB 9998 2
10 100000 stdint 181.88µs 393.16KB 2571 4
11 1000000 stdint 1.91ms 3.82MB 213 3
12 10000000 stdint 36.09ms 38.15MB 9 1
编辑2:
std::vector<std::string>
比在这些测试条件下稍慢cpp11::writable::strings
,但更节省内存:
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<std::string> stdstr_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<std::string> x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}')
cpp11::cpp_source(code='
#include "cpp11/strings.hpp"
[[cpp11::register]] cpp11::writable::strings cpp11str_grow_(R_xlen_t n) {
cpp11::writable::strings x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}
')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:5), pkg = c("cpp11str", "stdstr"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 1 cpp11str 1.22µs 0B 10000 0
2 10 cpp11str 3.02µs 0B 9999 1
3 100 cpp11str 22µs 1.89KB 9997 3
4 1000 cpp11str 765.28µs 541.62KB 602 2
5 10000 cpp11str 66.69ms 47.91MB 8 0
6 100000 cpp11str 6.83s 4.62GB 1 0
7 1 stdstr 1.38µs 2.49KB 10000 0
8 10 stdstr 1.86µs 2.49KB 10000 0
9 100 stdstr 16.44µs 3.32KB 10000 0
10 1000 stdstr 898.23µs 10.35KB 511 0
11 10000 stdstr 73.55ms 80.66KB 7 0
12 100000 stdstr 7.54s 783.79KB 1 0
解决方案(2022.01.12):
...对于那些有类似问题的人。在这种特殊情况下,我不需要std::vector
在 R 中使用数据。因此XPtr
很容易解决了我的问题,将 BAM 加载时间缩短了近两倍。创建指针:
std::vector<std::string>* seq = new std::vector<std::string>;
std::vector<std::string>* xm = new std::vector<std::string>;
然后存储为data.frame
属性:
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start
);
Rcpp::XPtr<std::vector<std::string>> seq_xptr(seq, true);
res.attr("seq_xptr") = seq_xptr;
Rcpp::XPtr<std::vector<std::string>> xm_xptr(xm, true);
res.attr("xm_xptr") = xm_xptr;
并在其他地方重复使用,如下所示:
Rcpp::XPtr<std::vector<std::string>> seq((SEXP)df.attr("seq_xptr"));
Rcpp::XPtr<std::vector<std::string>> xm((SEXP)df.attr("xm_xptr"));