c++ - std::function 与模板

Question

感谢 C++11，我们收到了std::function函子包装系列。不幸的是，我一直只听到关于这些新增功能的坏消息。最受欢迎的是它们非常慢。我对其进行了测试，与模板相比，它们确实很糟糕。

#include <iostream>
#include <functional>
#include <string>
#include <chrono>

template <typename F>
float calc1(F f) { return -1.0f * f(3.3f) + 666.0f; }

float calc2(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }

int main() {
    using namespace std::chrono;

    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        calc1([](float arg){ return arg * 0.5f; });
    }
    const auto tp2 = high_resolution_clock::now();

    const auto d = duration_cast<milliseconds>(tp2 - tp1);  
    std::cout << d.count() << std::endl;
    return 0;
}

111 毫秒与 1241 毫秒。我认为这是因为模板可以很好地内联，而functions 通过虚拟调用覆盖内部。

正如我所见，模板显然存在问题：

它们必须作为标题提供，这不是您在将库作为封闭代码发布时可能不希望做的事情，
extern template除非引入类似策略，否则它们可能会使编译时间更长，
没有（至少我知道）表达模板需求（概念，任何人？）的干净方式，禁止评论描述预期的函子类型。

因此，我是否可以假设functions 可以用作传递函子的事实上的标准，并且在期望高性能的地方应该使用模板？

编辑：

我的编译器是没有CTP的 Visual Studio 2012 。

score 182 · Accepted Answer

一般来说，如果您面临的设计情况让您有选择的余地，请使用模板。我强调了设计这个词，因为我认为你需要关注的是用例std::function和模板之间的区别，它们是非常不同的。

一般来说，模板的选择只是更广泛原则的一个实例：尝试在编译时指定尽可能多的约束。理由很简单：如果您可以在程序生成之前发现错误或类型不匹配，那么您就不会向您的客户发送有缺陷的程序。

此外，正如您正确指出的那样，对模板函数的调用是静态解析的（即在编译时），因此编译器具有优化并可能内联代码的所有必要信息（如果调用是通过虚表）。

是的，模板支持确实不完美，C++11 还缺乏对概念的支持；但是，我不知道如何std::function在这方面拯救你。std::function不是模板的替代品，而是用于无法使用模板的设计情况的工具。

当您需要在运行时通过调用遵循特定签名但其具体类型在编译时未知的可调用对象来解决调用时，就会出现一个这样的用例。当您有一组可能不同类型的回调，但您需要统一调用时，通常会出现这种情况；注册回调的类型和数量在运行时根据程序的状态和应用程序逻辑确定。其中一些回调可能是函子，一些可能是普通函数，一些可能是将其他函数绑定到某些参数的结果。

std::function并且std::bind还提供了一种在 C++ 中启用函数式编程的自然习惯用法，其中函数被视为对象并自然地柯里化并组合以生成其他函数。虽然这种组合也可以通过模板来实现，但类似的设计情况通常与需要在运行时确定组合可调用对象的类型的用例一起出现。

最后，还有其他情况std::function是不可避免的，例如，如果你想编写递归 lambdas；然而，我认为这些限制更多是由技术限制决定的，而不是由概念上的区别决定的。

总而言之，专注于设计并尝试了解这两种结构的概念用例是什么。如果你以你所做的方式将它们进行比较，你就是在强迫他们进入一个他们可能不属于的领域。

score 92 · Accepted Answer

Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to std::function.

First of all, a quick remark on the measurement technique: The 11ms obtained for calc1 has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of calling calc1 is independent of the iteration and moves the call out of the loop:

for (int i = 0; i < 1e8; ++i) {
}
calc1([](float arg){ return arg * 0.5f; });

Furthermore, it realises that calling calc1 has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.

As it has been pointed out, the optimizer has more troubles to understand std::function and doesn't move the call out of the loop. So 1241ms is a fair measurement for calc2.

Notice that, std::function is able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call to new). It's well known that this is a quite costly operation.

The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).

To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three floats. This makes the callable object too big to apply the small object optimization:

float a, b, c; // never mind the values
// ...
calc2([a,b,c](float arg){ return arg * 0.5f; });

For this version, the time is approximately 16000ms (compared to 1241ms for the original code).

Finally, notice that the lifetime of the lambda encloses that of the std::function. In this case, rather than storing a copy of the lambda, std::function could store a "reference" to it. By "reference" I mean a std::reference_wrapper which is easily build by functions std::ref and std::cref. More precisely, by using:

auto func = [a,b,c](float arg){ return arg * 0.5f; };
calc2(std::cref(func));

the time decreases to approximately 1860ms.

I wrote about that a while ago:

http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059

As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.

score 38 · Accepted Answer

使用 Clang，两者之间没有性能差异

使用 clang (3.2, trunk 166872)（Linux 上的 -O2），这两种情况的二进制文件实际上是相同的。

-我会在帖子结束时回来clang。但首先，gcc 4.7.2：

已经有很多见解了，但我想指出，由于内联等原因，calc1 和 calc2 的计算结果不一样。例如比较所有结果的总和：

float result=0;
for (int i = 0; i < 1e8; ++i) {
  result+=calc2([](float arg){ return arg * 0.5f; });
}

用 calc2 变成

1.71799e+10, time spent 0.14 sec

而使用 calc1 它变成

6.6435e+10, time spent 5.772 sec

这是速度差异的约 40 倍，值的约 4 倍。第一个差异比 OP 发布的内容（使用 Visual Studio）大得多。实际上打印出结尾的值也是一个好主意，以防止编译器删除没有可见结果的代码（as-if 规则）。Cassio Neri 在他的回答中已经说过了。注意结果有多么不同——在比较执行不同计算的代码的速度因子时应该小心。

此外，公平地说，比较重复计算 f(3.3) 的各种方法可能并不那么有趣。如果输入是恒定的，则不应处于循环中。（优化器很容易注意到）

如果我向 calc1 和 2 添加用户提供的值参数，则 calc1 和 calc2 之间的速度因子从 40 下降到 5！使用 Visual Studio，差异接近 2 倍，而使用 clang 则没有差异（见下文）。

此外，由于乘法速度很快，因此谈论减速因素通常不是那么有趣。一个更有趣的问题是，你的函数有多小，这些调用是实际程序中的瓶颈吗？

铛：

当我在示例代码的 calc1 和 calc2 之间切换时，Clang（我使用 3.2）实际上生成了相同的二进制文件（发布在下面）。对于问题中发布的原始示例，两者也是相同的，但根本不需要时间（如上所述，循环只是完全删除）。在我修改后的示例中，使用 -O2：

执行的秒数（最好的 3 秒）：

clang:        calc1:           1.4 seconds
clang:        calc2:           1.4 seconds (identical binary)

gcc 4.7.2:    calc1:           1.1 seconds
gcc 4.7.2:    calc2:           6.0 seconds

VS2012 CTPNov calc1:           0.8 seconds 
VS2012 CTPNov calc2:           2.0 seconds 

VS2015 (14.0.23.107) calc1:    1.1 seconds 
VS2015 (14.0.23.107) calc2:    1.5 seconds 

MinGW (4.7.2) calc1:           0.9 seconds
MinGW (4.7.2) calc2:          20.5 seconds

所有二进制文件的计算结果都是相同的，并且所有测试都在同一台机器上执行。如果有更深入的 clang 或 VS 知识的人可以评论可能已经完成的优化，那将会很有趣。

我修改后的测试代码：

#include <functional>
#include <chrono>
#include <iostream>

template <typename F>
float calc1(F f, float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

float calc2(std::function<float(float)> f,float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

int main() {
    using namespace std::chrono;

    const auto tp1 = high_resolution_clock::now();

    float result=0;
    for (int i = 0; i < 1e8; ++i) {
      result=calc1([](float arg){ 
          return arg * 0.5f; 
        },result);
    }
    const auto tp2 = high_resolution_clock::now();

    const auto d = duration_cast<milliseconds>(tp2 - tp1);  
    std::cout << d.count() << std::endl;
    std::cout << result<< std::endl;
    return 0;
}

更新：

添加了vs2015。我还注意到在 calc1,calc2 中有 double->float 转换。删除它们并不会改变 Visual Studio 的结论（两者都快得多，但比例大致相同）。

score 14 · Accepted Answer

不同不一样。

它比较慢，因为它做了模板不能做的事情。特别是，它允许您调用可以使用给定参数类型调用的任何函数，并且其返回类型可以从相同的代码转换为给定的返回类型。

void eval(const std::function<int(int)>& f) {
    std::cout << f(3);
}

int f1(int i) {
    return i;
}

float f2(double d) {
    return d;
}

int main() {
    std::function<int(int)> fun(f1);
    eval(fun);
    fun = f2;
    eval(fun);
    return 0;
}

请注意，相同的函数对象fun被传递给对的两个调用eval。它具有两种不同的功能。

如果您不需要这样做，那么您不应该使用std::function.

score 8 · Accepted Answer

您在这里已经有了一些很好的答案，所以我不会反驳它们，简而言之，将 std::function 与模板进行比较就像将虚函数与函数进行比较。您永远不应该“更喜欢”虚函数而不是函数，而是在适合问题时使用虚函数，将决策从编译时转移到运行时。这个想法是，您不必使用定制的解决方案（如跳转表）来解决问题，而是使用能让编译器更好地为您优化的东西。如果您使用标准解决方案，它还可以帮助其他程序员。

score 6 · Accepted Answer

该答案旨在为现有答案集做出贡献，我认为这是对 std::function 调用的运行时成本更有意义的基准。

应该识别 std::function 机制提供的内容：任何可调用实体都可以转换为具有适当签名的 std::function。假设您有一个库，该库将曲面拟合到由 z = f(x,y) 定义的函数，您可以将其编写为接受 a std::function<double(double,double)>，并且库的用户可以轻松地将任何可调用实体转换为该函数；无论是普通函数、类实例的方法、lambda，还是 std::bind 支持的任何东西。

与模板方法不同，这无需针对不同情况重新编译库函数即可工作；因此，对于每个额外的情况，几乎不需要额外的编译代码。实现这一点一直是可能的，但它过去需要一些笨拙的机制，并且库的用户可能需要围绕他们的函数构建一个适配器才能使其工作。std::function 自动构造所需的任何适配器，以获得所有情况下的公共运行时调用接口，这是一个非常强大的新功能。

在我看来，就性能而言，这是 std::function 最重要的用例：我对在构造一次 std::function 后多次调用它的成本感兴趣，它需要是编译器无法通过知道实际调用的函数来优化调用的情况（即，您需要将实现隐藏在另一个源文件中以获得正确的基准）。

我在下面进行了测试，类似于 OP；但主要变化是：

每个 case 循环 10 亿次，但 std::function 对象只构造一次。通过查看输出代码，我发现在构造实际的 std::function 调用时调用了“operator new”（可能不是在优化它们时）。
测试被分成两个文件以防止不希望的优化
我的情况是： (a) 函数是内联的 (b) 函数由普通函数指针传递 (c) 函数是包装为 std::function 的兼容函数 (d) 函数是与 std:: 兼容的不兼容函数绑定，包装为 std::function

我得到的结果是：

案例（a）（内联）1.3 nsec
所有其他情况：3.3 纳秒。

情况 (d) 往往会稍微慢一些，但差异（大约 0.05 纳秒）被噪声吸收了。

结论是 std::function 与使用函数指针的开销（在调用时）相当，即使对实际函数进行了简单的“绑定”调整。内联比其他方法快 2 ns，但这是一个预期的权衡，因为内联是唯一在运行时“硬连线”的情况。

当我在同一台机器上运行 johan-lundberg 的代码时，我看到每个循环大约 39 纳秒，但那里的循环还有很多，包括 std::function 的实际构造函数和析构函数，这可能相当高因为它涉及新建和删除。

-O2 gcc 4.8.1，到 x86_64 目标（核心 i5）。

请注意，代码被分成两个文件，以防止编译器在调用它们的地方扩展函数（除了打算这样做的一种情况）。

----- 第一个源文件 --------------

#include <functional>


// simple funct
float func_half( float x ) { return x * 0.5; }

// func we can bind
float mul_by( float x, float scale ) { return x * scale; }

//
// func to call another func a zillion times.
//
float test_stdfunc( std::function<float(float)> const & func, int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func(x);
    }
    return y;
}

// same thing with a function pointer
float test_funcptr( float (*func)(float), int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func(x);
    }
    return y;
}

// same thing with inline function
float test_inline(  int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func_half(x);
    }
    return y;
}

----- 第二个源文件 -------------

#include <iostream>
#include <functional>
#include <chrono>

extern float func_half( float x );
extern float mul_by( float x, float scale );
extern float test_inline(  int nloops );
extern float test_stdfunc( std::function<float(float)> const & func, int nloops );
extern float test_funcptr( float (*func)(float), int nloops );

int main() {
    using namespace std::chrono;


    for(int icase = 0; icase < 4; icase ++ ){
        const auto tp1 = system_clock::now();

        float result;
        switch( icase ){
         case 0:
            result = test_inline( 1e9);
            break;
         case 1:
            result = test_funcptr( func_half, 1e9);
            break;
         case 2:
            result = test_stdfunc( func_half, 1e9);
            break;
         case 3:
            result = test_stdfunc( std::bind( mul_by, std::placeholders::_1, 0.5), 1e9);
            break;
        }
        const auto tp2 = high_resolution_clock::now();

        const auto d = duration_cast<milliseconds>(tp2 - tp1);  
        std::cout << d.count() << std::endl;
        std::cout << result<< std::endl;
    }
    return 0;
}

对于那些感兴趣的人，这里是编译器为使“mul_by”看起来像一个 float(float) 而构建的适配器 - 当调用创建为 bind(mul_by,_1,0.5) 的函数时，这是“调用”的：

movq    (%rdi), %rax                ; get the std::func data
movsd   8(%rax), %xmm1              ; get the bound value (0.5)
movq    (%rax), %rdx                ; get the function to call (mul_by)
cvtpd2ps    %xmm1, %xmm1        ; convert 0.5 to 0.5f
jmp *%rdx                       ; jump to the func

（所以如果我在绑定中写了 0.5f 可能会快一点...）请注意，'x' 参数到达 %xmm0 并停留在那里。

这是在调用 test_stdfunc 之前构造函数的区域中的代码 - 通过 c++filt 运行：

movl    $16, %edi
movq    $0, 32(%rsp)
call    operator new(unsigned long)      ; get 16 bytes for std::function
movsd   .LC0(%rip), %xmm1                ; get 0.5
leaq    16(%rsp), %rdi                   ; (1st parm to test_stdfunc) 
movq    mul_by(float, float), (%rax)     ; store &mul_by  in std::function
movl    $1000000000, %esi                ; (2nd parm to test_stdfunc)
movsd   %xmm1, 8(%rax)                   ; store 0.5 in std::function
movq    %rax, 16(%rsp)                   ; save ptr to allocated mem

   ;; the next two ops store pointers to generated code related to the std::function.
   ;; the first one points to the adaptor I showed above.

movq    std::_Function_handler<float (float), std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_invoke(std::_Any_data const&, float), 40(%rsp)
movq    std::_Function_base::_Base_manager<std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation), 32(%rsp)


call    test_stdfunc(std::function<float (float)> const&, int)

score 4 · Accepted Answer

我发现您的结果非常有趣，因此我进行了一些挖掘以了解发生了什么。首先，正如许多其他人所说，如果没有计算结果影响程序的状态，编译器只会优化它。其次，有一个常数 3.3 作为回调的武器，我怀疑还会有其他优化。考虑到这一点，我稍微更改了您的基准代码。

template <typename F>
float calc1(F f, float i) { return -1.0f * f(i) + 666.0f; }
float calc2(std::function<float(float)> f, float i) { return -1.0f * f(i) + 666.0f; }
int main() {
    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        t += calc2([&](float arg){ return arg * 0.5f + t; }, i);
    }
    const auto tp2 = high_resolution_clock::now();
}

鉴于我使用 gcc 4.8 -O3 编译的代码的这种更改，并且 calc1 的时间为 330ms，calc2 的时间为 2702。所以使用模板快了 8 倍，这个数字在我看来是可疑的，8 的幂的速度通常表明编译器已经向量化了一些东西。当我查看为模板版本生成的代码时，它显然是矢量化的

.L34:
cvtsi2ss        %edx, %xmm0
addl    $1, %edx
movaps  %xmm3, %xmm5
mulss   %xmm4, %xmm0
addss   %xmm1, %xmm0
subss   %xmm0, %xmm5
movaps  %xmm5, %xmm0
addss   %xmm1, %xmm0
cvtsi2sd        %edx, %xmm1
ucomisd %xmm1, %xmm2
ja      .L37
movss   %xmm0, 16(%rsp)

std::function 版本不是。这对我来说很有意义，因为使用模板，编译器肯定知道函数在整个循环中永远不会改变，但是传入的 std::function 可能会改变，因此不能向量化。

这导致我尝试其他方法，看看是否可以让编译器在 std::function 版本上执行相同的优化。我没有传入函数，而是将 std::function 作为全局变量，并调用它。

float calc3(float i) {  return -1.0f * f2(i) + 666.0f; }
std::function<float(float)> f2 = [](float arg){ return arg * 0.5f; };

int main() {
    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        t += calc3([&](float arg){ return arg * 0.5f + t; }, i);
    }
    const auto tp2 = high_resolution_clock::now();
}

在这个版本中，我们看到编译器现在以相同的方式对代码进行了矢量化，我得到了相同的基准测试结果。

模板：330ms
标准::函数：2702ms
全局 std::function: 330ms

所以我的结论是 std::function 与模板仿函数的原始速度几乎相同。然而，它使优化器的工作变得更加困难。

score 1 · Accepted Answer

如果您使用模板而不是std::function在C++20中，您实际上可以使用可变参数模板编写自己的概念（受 Hendrik Niemeyer 关于 C++20 概念的讨论的启发）：

template<class Func, typename Ret, typename... Args>
concept functor = std::regular_invocable<Func, Args...> && 
                  std::same_as<std::invoke_result_t<Func, Args...>, Ret>;

然后，您可以将其用作functor<Ret, Args...> F>返回Ret值和Args...可变参数输入参数的位置。比如functor<double,int> F比如

template <functor<double,int> F>
auto CalculateSomething(F&& f, int const arg) {
  return f(arg)*f(arg);
}

需要一个仿函数作为模板参数，该参数必须重载()运算符，并具有double返回值和一个类型的输入参数int。类似functor<double>的还有一个返回类型的函子，double它不接受任何输入参数。

在这里试试！

您还可以将它与可变参数函数一起使用，例如

template <typename... Args, functor<double, Args...> F>
auto CalculateSomething(F&& f, Args... args) {
  return f(args...)*f(args...);
}

在这里试试！

c++ - std::function 与模板

8 回答 8

使用 Clang，两者之间没有性能差异

铛：

我修改后的测试代码：

Related

Reference