c++ - 可变参数函数的内联

Question

在玩优化设置时，我注意到一个有趣的现象：带有可变数量参数 ( ...) 的函数似乎从来没有被内联。（显然这种行为是特定于编译器的，但我已经在几个不同的系统上进行了测试。）

例如编译如下小程序：

#include <stdarg.h>
#include <stdio.h>

static inline void test(const char *format, ...)
{
  va_list ap;
  va_start(ap, format);
  vprintf(format, ap);
  va_end(ap);
}

int main()
{
  test("Hello %s\n", "world");
  return 0;
}

似乎总是会导致test生成的可执行文件中出现一个（可能是损坏的）符号（在 MacOS 和 Linux 上的 C 和 C++ 模式下使用 Clang 和 GCC 进行测试）。如果修改签名 oftest()以获取传递给的纯字符串，则两个编译器都会按照您的期望printf()从向上内联该函数。-O1

我怀疑这与用于实现可变参数的巫毒魔法有关，但是这通常是如何完成的对我来说是个谜。谁能告诉我编译器通常如何实现可变参数函数，以及为什么这似乎阻止了内联？

score 11 · Accepted Answer

至少在 x86-64 上，var_args 的传递非常复杂（由于在寄存器中传递参数）。其他架构可能没有那么复杂，但它很少是微不足道的。特别是，可能需要在获取每个参数时引用堆栈帧或帧指针。这些规则很可能会阻止编译器内联函数。

x86-64 的代码包括将所有整数参数和 8 个 sse 寄存器推入堆栈。

这是使用 Clang 编译的原始代码中的函数：

test:                                   # @test
    subq    $200, %rsp
    testb   %al, %al
    je  .LBB1_2
# BB#1:                                 # %entry
    movaps  %xmm0, 48(%rsp)
    movaps  %xmm1, 64(%rsp)
    movaps  %xmm2, 80(%rsp)
    movaps  %xmm3, 96(%rsp)
    movaps  %xmm4, 112(%rsp)
    movaps  %xmm5, 128(%rsp)
    movaps  %xmm6, 144(%rsp)
    movaps  %xmm7, 160(%rsp)
.LBB1_2:                                # %entry
    movq    %r9, 40(%rsp)
    movq    %r8, 32(%rsp)
    movq    %rcx, 24(%rsp)
    movq    %rdx, 16(%rsp)
    movq    %rsi, 8(%rsp)
    leaq    (%rsp), %rax
    movq    %rax, 192(%rsp)
    leaq    208(%rsp), %rax
    movq    %rax, 184(%rsp)
    movl    $48, 180(%rsp)
    movl    $8, 176(%rsp)
    movq    stdout(%rip), %rdi
    leaq    176(%rsp), %rdx
    movl    $.L.str, %esi
    callq   vfprintf
    addq    $200, %rsp
    retq

并来自 gcc：

test.constprop.0:
    .cfi_startproc
    subq    $216, %rsp
    .cfi_def_cfa_offset 224
    testb   %al, %al
    movq    %rsi, 40(%rsp)
    movq    %rdx, 48(%rsp)
    movq    %rcx, 56(%rsp)
    movq    %r8, 64(%rsp)
    movq    %r9, 72(%rsp)
    je  .L2
    movaps  %xmm0, 80(%rsp)
    movaps  %xmm1, 96(%rsp)
    movaps  %xmm2, 112(%rsp)
    movaps  %xmm3, 128(%rsp)
    movaps  %xmm4, 144(%rsp)
    movaps  %xmm5, 160(%rsp)
    movaps  %xmm6, 176(%rsp)
    movaps  %xmm7, 192(%rsp)
.L2:
    leaq    224(%rsp), %rax
    leaq    8(%rsp), %rdx
    movl    $.LC0, %esi
    movq    stdout(%rip), %rdi
    movq    %rax, 16(%rsp)
    leaq    32(%rsp), %rax
    movl    $8, 8(%rsp)
    movl    $48, 12(%rsp)
    movq    %rax, 24(%rsp)
    call    vfprintf
    addq    $216, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

在 x86 的 clang 中，它要简单得多：

test:                                   # @test
    subl    $28, %esp
    leal    36(%esp), %eax
    movl    %eax, 24(%esp)
    movl    stdout, %ecx
    movl    %eax, 8(%esp)
    movl    %ecx, (%esp)
    movl    $.L.str, 4(%esp)
    calll   vfprintf
    addl    $28, %esp
    retl

没有什么能真正阻止上述任何代码被内联，所以看起来这只是编译器编写者的一个策略决定。当然，对于类似的调用，为了printf代码扩展的成本优化调用/返回对是毫无意义的——毕竟， printf 不是一个小的短函数。

（在过去一年的大部分时间里，我的大部分工作都是在 OpenCL 环境中实现 printf，所以我知道的远比大多数人甚至会查找格式说明符和 printf 的其他各种棘手部分要多得多）

编辑：我们使用的 OpenCL 编译器将内联调用 var_args 函数，因此可以实现这样的事情。它不会对 printf 的调用执行此操作，因为它会使代码非常臃肿，但默认情况下，我们的编译器会一直内联所有内容，无论它是什么......它确实有效，但我们发现有代码中的 2-3 个 printf 副本使其非常庞大（还有各种其他缺点，包括由于编译器后端中一些错误的算法选择导致最终代码生成需要更长的时间），因此我们不得不将代码添加到 STOP编译器这样做...

score 5 · Accepted Answer

可变参数实现一般有以下算法：从堆栈中取出格式字符串之后的第一个地址，并在解析输入格式字符串时使用给定位置的值作为所需的数据类型。现在用所需数据类型的大小增加堆栈解析指针，在格式字符串中前进并将新位置的值用作所需的数据类型......等等。

某些值会自动转换（即：提升）为“更大”类型（这或多或少取决于实现），例如charorshort被提升为int和float。double

当然，您不需要格式字符串，但在这种情况下，您需要知道传入参数的类型（例如：所有整数，或所有双精度数，或前 3 个整数，然后再有 3 个双精度数..）。

所以这是简短的理论。

现在，对于实践，正如上面 nm 的评论所示，gcc 不会内联具有可变参数处理的函数。在处理变量参数时可能会进行非常复杂的操作，这会将代码的大小增加到非最佳大小，因此根本不值得内联这些函数。

编辑：

在使用 VS2012 进行快速测试后，我似乎无法说服编译器使用变量参数内联函数。无论项目的“优化”选项卡中的标志组合如何，总是有一个调用test并且总是有一个test方法。事实上：

http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

说

即使使用 __forceinline，编译器也不能在所有情况下内联代码。如果出现以下情况，编译器不能内联函数： ...

该函数有一个可变参数列表。

score 1 · Accepted Answer

内联的要点是它减少了函数调用开销。

但是对于可变参数，一般来说几乎没有什么收获。
在该函数的主体中考虑以下代码：

if (blah)
{
    printf("%d", va_arg(vl, int));
}
else
{
    printf("%s", va_arg(vl, char *));
}

How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.

score 1 · Accepted Answer

I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.

A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.

A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.

I do not see a feasible path for this second case.

c++ - 可变参数函数的内联

4 回答 4

Related

Reference