0

我有一个与 OpenMP 并行的循环(EstimateUniques)函数。我建议多线程应该比多处理更高效,但是当我将此函数与“mclapply”的简单运行进行比较时,它显示出较低的性能。在 C++ 中实现与 R 相同级别的并行化的正确方法是什么?难道我做错了什么?

性能比较(以秒为单位的时间):

#Cores    CPP     R
   1    1.721s  1.538s
   2    1.945s  1.080s
   3    2.858s  0.801s

代码:

Rcpp::sourceCpp('ReproducibleExample.cpp')

arr <- 1:10000
n_rep <- 150
n_iters <- 200

EstimateUniquesR <- function(arr, n_iters, n_rep, cores) {
  parallel::mclapply(1:n_iters, function(i) 
    GetNumberOfUniqSamples(arr, i * 10, n_rep), mc.cores=cores)
}

cpp_times <- sapply(1:3, function(threads) 
  system.time(EstimateUniques(arr, n_iters, n_rep, threads))['elapsed'])
r_times <- sapply(1:3, function(cores) 
  system.time(EstimateUniquesR(arr, n_iters, n_rep, cores))['elapsed'])

data.frame(CPP=cpp_times, R=r_times)

示例.cpp 文件:

// [[Rcpp::plugins(openmp)]]
// [[Rcpp::plugins(cpp11)]]

#include <algorithm>
#include <vector>
#include <omp.h>

// [[Rcpp::export]]
int GetNumberOfUniqSamples(const std::vector<int> &bs_array, int size, unsigned n_rep) {
  unsigned long sum = 0;
  for (unsigned i = 0; i < n_rep; ++i) {
    std::vector<int> uniq_vals(size);
    for (int try_num = 0; try_num < size; ++try_num) {
      uniq_vals[try_num] = bs_array[rand() % bs_array.size()];
    }
    std::sort(uniq_vals.begin(), uniq_vals.end());
    sum += std::distance(uniq_vals.begin(), std::unique(uniq_vals.begin(), uniq_vals.end()));
  }

  return std::round(double(sum) / n_rep);
}

// [[Rcpp::export]]
std::vector<int> EstimateUniques(const std::vector<int> &bs_array, const int n_iters, 
                                 const int n_rep = 1000, const int threads=1) {
  std::vector<int> uniq_counts(n_iters);

#pragma omp parallel for num_threads(threads) schedule(dynamic)
  for (int i = 0; i < n_iters; ++i) {
    uniq_counts[i] = GetNumberOfUniqSamples(bs_array, (i + 1) * 10, n_rep);
  }

  return uniq_counts;
}

我尝试在 OpenMP 中使用其他类型的调度,但结果更差。

4

0 回答 0