c++ - 实现上的巨大差异？

Question

我正在为发行版编写一些功能，并使用正态分布在我的实现和 C++ Boost 之间运行测试。

给定概率密度函数（pdf： http: //www.mathworks.com/help/stats/normpdf.html）

我是这样写的：

double NormalDistribution1D::prob(double x) {
    return (1 / (sigma * (std::sqrt(boost::math::constants::pi<double>()*2))))*std::exp((-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)));
}

将我的结果与使用 C++ Boost 的方式进行比较：

    boost::math::normal_distribution <> d(mu, sigma);
    return boost::math::pdf(d, x);

我并不是很惊讶——我的版本花了44278纳秒，只提升了326。

所以我玩了一下，在我的 NormalDistribution1D-Class 中编写了 probboost 方法，并比较了所有三个：

void MATTest::runNormalDistribution1DTest1() {
    double mu = 0;
    double sigma = 1;
    double x = 0;

    std::chrono::high_resolution_clock::time_point tn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
    double nres = n->prob(x);
    std::chrono::high_resolution_clock::time_point tn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point tdn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n1 = new NormalDistribution1D(mu, sigma);
    double nres1 = n1->probboost(x);
    std::chrono::high_resolution_clock::time_point tdn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point td_start = std::chrono::high_resolution_clock::now();
    boost::math::normal_distribution <> d(mu, sigma);
    double dres = boost::math::pdf(d, x);
    std::chrono::high_resolution_clock::time_point td_end = std::chrono::high_resolution_clock::now();

    std::cout << "Mu : " << mu << "; Sigma: " << sigma << "; x" << x << std::endl;
    if (nres == dres) {
        std::cout << "Result" << nres << std::endl;
    } else {
        std::cout << "\033[1;31mRes incorrect: " << nres << "; Correct: " << dres << "\033[0m" << std::endl;
    }


    auto duration_n = std::chrono::duration_cast<std::chrono::nanoseconds>(tn_end - tn_start).count();
    auto duration_d = std::chrono::duration_cast<std::chrono::nanoseconds>(td_end - td_start).count();
    auto duration_dn = std::chrono::duration_cast<std::chrono::nanoseconds>(tdn_end - tdn_start).count();

    std::cout << "own boost: " << duration_dn << std::endl;
    if (duration_n < duration_d) {
        std::cout << "Boost: " << (duration_d) << "; own implementation: " << duration_n << std::endl;
    } else {
        std::cout << "\033[1;31mBoost faster: " << (duration_d) << "; than own implementation: " << duration_n << "\033[0m" << std::endl;
    }
}

结果是（正在编译和运行检查方法 3 次）

自身提升：1082 提升更快：326；比自己的实现：44278

自身提升：774 提升更快：216；比自己的实现：34291

自身提升：769 提升更快：230；比自己的实现：33456

现在这让我很困惑：类中的方法怎么可能比直接调用的语句花费的时间长 3 倍？

我的编译选项：

g++ -O2   -c -g -std=c++11 -MMD -MP -MF "build/Debug/GNU-Linux-x86/main.o.d" -o build/Debug/GNU-Linux-x86/main.o main.cpp

g++ -O2    -o ***Classes***

score 3 · Accepted Answer

首先，您正在动态分配对象，其中new：

NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);

如果您像使用 boost 所做的那样做，那么仅此一项就足以具有相同（或相当）的速度：

NormalDistribution1D n(mu, sigma);
double nres = n.prob(x);

现在，我不知道您拼写表达式的方式NormalDistribution1D::prob()是否重要，但我怀疑以更“优化”的方式编写它会有所不同，因为这样的算术表达式就是这样的事情编译器可以很好地优化。如果您使用开关，它可能会变得更快--ffast-math，这将为编译器提供更多优化自由。

此外，如果的定义double NormalDistribution1D::prob(double x)在另一个编译单元（另一个 .cpp 文件）中，编译器将无法内联它，这也会产生明显的开销（可能慢两倍或更少）。在 boost 中，几乎所有东西都在头文件中实现，所以当编译器看起来合适时，内联总是会发生。如果你编译和链接 gcc 的-flto开关，你可以克服这个问题。

score 2 · Accepted Answer

您没有使用该-ffast-math选项进行编译。这意味着编译器不能（事实上，绝不能！）简化(-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma))为类似于中使用的形式boost::math::pdf，

expo = (x - mu) / sigma
expo *= -x
expo /= 2
result = std::exp(expo)
result /= sigma * std::sqrt(2 * boost::math::constants::pi<double>())

以上强制编译器在不使用-ffast_math.

new其次，与从堆 ( ) 与堆栈（局部变量）分配所需的时间相比，上述代码与您的代码之间的时间差异很小。您正在计算分配动态内存的成本。

c++ - 实现上的巨大差异？

2 回答 2

Related

Reference