c++ - what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

Question

Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.

Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.

What I'm interested in:

which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
does anyone have own experience about this performance impact?

I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).

score 47 · Accepted Answer

哪些因素会影响这些操作的性能？可能是编译器和编译器版本；但是操作系统或 CPU 品牌/型号是否也会影响这一点？

主要是处理器架构（和模型 - 请阅读我在本节中提到处理器架构的模型）。编译器可能会有一些影响，但是大多数编译器在这方面做得很好，所以处理器架构会比编译器产生更大的影响。

操作系统不会有任何影响（除了“如果你改变操作系统，你需要使用不同类型的编译器来改变编译器的功能”在某些情况下——但这可能是一个很小的影响）。

普通的 32 位系统会使用现代 CPU 的 64 位寄存器吗？

这是不可能的。如果系统处于 32 位模式，它将充当 32 位系统，寄存器的额外 32 位是完全不可见的，就像系统实际上是“真正的 32 位系统”一样.

在 32 位上模拟时，哪些操作会特别慢？或者哪个几乎不会放缓？

加法和减法更糟糕，因为它们必须按两个操作的顺序完成，而第二个操作需要第一个操作完成 - 如果编译器只是对独立数据产生两个加法操作，则情况并非如此。

如果输入参数实际上是 64 位，乘法会变得更糟——例如，2^35 * 83 比 2^31 * 2^31 差。这是因为处理器可以很好地产生 32 x 32 位乘法到 64 位结果 - 大约 5-10 个时钟周期。但是 64 x 64 位乘法需要相当多的额外代码，因此需要更长的时间。

除法是与乘法类似的问题——但这里可以在一侧取一个 64 位输入，将其除以一个 32 位值并得到一个 32 位值。由于很难预测何时会起作用，因此 64 位除法可能几乎总是很慢。

数据也将占用两倍的缓存空间，这可能会影响结果。并且作为类似的结果，一般分配和传递数据将花费最少两倍的时间，因为要操作的数据量是两倍。

编译器还需要使用更多的寄存器。

在 32 位系统上使用 int64_t/uint64_t 是否有任何现有的基准测试结果？

可能，但我不知道。即使有，它也只会对您有点意义，因为操作的组合对操作速度非常关键。

如果性能是您的应用程序的重要组成部分，那么对您的代码（或其中的一些代表性部分）进行基准测试。如果您的代码在相同情况下速度慢或快一些完全不同的量，Benchmark X 是否给出慢 5%、25% 或 103% 的结果并不重要。

有人对这种性能影响有自己的经验吗？

我已经为 64 位架构重新编译了一些使用 64 位整数的代码，并发现性能得到了相当大的提升——在某些代码位上提高了 25%。

将您的操作系统更改为同一操作系统的 64 位版本，也许会有所帮助？

编辑：

因为我喜欢找出这类事情的不同之处，所以我编写了一些代码，并使用了一些原始模板（仍在学习那部分 - 模板并不是我最热门的话题，我必须说 - 给我位摆弄和指针算术，我（通常）会做对......）

这是我编写的代码，试图复制一些常见的功能：

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i];
    return sum;
}


template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
    T sum[size] = {};
    for(int i = 0; i < size; i++)
    {
    for(int j = 0; j < size; j++)
        sum[i] += v[i][j];
    }
    T tsum=0;
    for(int i = 0; i < size; i++)
    tsum += sum[i];
    return tsum;
}



template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] * mul;
    return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] / mul;
    return sum;
}


template<typename T> 
void fill_array(T *v, const int size)
{
    for(int i = 0; i < size; i++)
    v[i] = i;
}

template<typename T, const int size> 
void fill_array(T v[size][size])
{
    for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
        v[i][j] = i + size * j;
}




uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
    uint32_t res = add_numbers(v, size);
    return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
    uint64_t res = add_numbers(v, size);
    return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v, c, size);
    return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v, c, size);
    return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v, c, size);
    return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_div_numbers(v, c, size);
    return res;
}


template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
    uint32_t res = add_matrix(v);
    return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
    uint64_t res = add_matrix(v);
    return res;
}


template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
    fill_array(v, size);

    uint64_t long t = rdtsc();
    T res = func(v, size);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
    fill_array(v);

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}


int main()
{
    // spin up CPU to full speed...
    time_t t = time(NULL);
    while(t == time(NULL)) ;

    const int vsize=10000;

    uint32_t v32[vsize];
    uint64_t v64[vsize];

    uint32_t m32[100][100];
    uint64_t m64[100][100];


    runbench(bench_add_numbers, "Add 32", v32, vsize);
    runbench(bench_add_numbers, "Add 64", v64, vsize);

    runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
    runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

    runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
    runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

    runbench2(bench_matrix, "Matrix 32", m32);
    runbench2(bench_matrix, "Matrix 64", m64);
}

编译：

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

结果是：注意：请参阅下面的 2016 年结果- 由于 64 位模式下 SSE 指令的使用存在差异，这些结果略微乐观，但 32 位模式下没有 SSE 使用。

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

如您所见，加法和乘法并没有那么糟糕。部门变得非常糟糕。有趣的是，矩阵加法根本没有太大区别。

并且在 64 位上它更快吗？我听到你们中的一些人问：使用相同的编译器选项，只需 -m64 而不是 -m32 - yupp，更快：

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

编辑，2016 年更新：在 32 位和 64 位编译器模式下的四个变体，带和不带 SSE。

这些天我通常使用 clang++ 作为我常用的编译器。我尝试使用 g++ 进行编译（但它仍然是与上面不同的版本，因为我已经更新了我的机器 - 而且我也有不同的 CPU）。由于 g++ 无法在 64 位中编译 no-sse 版本，因此我没有看到其中的意义。（无论如何，g++ 给出了类似的结果）

作为一个短表：

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
----------------------------------------------------------
Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
----------------------------------------------------------
Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
----------------------------------------------------------
Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
----------------------------------------------------------
Add Div 32     |   44570   |   47695   |  17713 |  17523 |
----------------------------------------------------------
Add Div 64     |  405258   |   52875   | 405150 |  47043 |
----------------------------------------------------------
Matrix 32      |   41470   |   15811   |  21542 |   8622 |
----------------------------------------------------------
Matrix 64      |   22184   |   15168   |  13757 |  12448 |

带有编译选项的完整结果。

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757


$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448


$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168

score 10 · Accepted Answer

关于在 32 位模式下做 64 位数学的知识比你想知道的还要多……

当您在 32 位模式下使用 64 位数字时（即使在 64 位 CPU 上，如果代码被编译为 32 位），它们被存储为两个单独的 32 位数字，一个存储数字的高位，并且另一个存储低位。其影响取决于指令。（tl；博士 - 通常，在 32 位 CPU 上进行 64 位数学运算理论上慢 2 倍，只要您不除/模，但实际上差异会更小（1.3 倍是我的猜测），因为通常程序不只是对 64 位整数进行数学运算，而且由于流水线，程序中的差异可能要小得多）。

加法/减法

许多架构支持所谓的进位标志。当加法结果溢出或减法结果没有下溢时设置。这些位的行为可以通过长加法和长减法来显示。此示例中的 C 显示比最高可表示位高的位（操作期间）或进位标志（操作后）。

  C 7 6 5 4 3 2 1 0      C 7 6 5 4 3 2 1 0
  0 1 1 1 1 1 1 1 1      1 0 0 0 0 0 0 0 0
+   0 0 0 0 0 0 0 1    -   0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0    = 0 1 1 1 1 1 1 1 1

为什么进位标志是相关的？好吧，CPU 通常有两个单独的加法和减法运算。在 x86 中，加法运算称为addand adc。add代表加法，而adc带进位的加法。它们之间的区别在于adc考虑进位，如果它被设置，它会在结果中加一。

类似地，如果进位位未设置，带进位的减法会从结果中减去 1。

这种行为允许轻松地对整数实现任意大小的加法和减法。x和y相加的结果（假设它们是 8 位）永远不会大于0x1FE. 如果你添加1，你会得到0x1FF。因此，9 位足以表示任何 8 位加法的结果。如果您从开始添加add，然后添加超出初始位的任何位adc，您可以对您喜欢的任何大小的数据进行添加。

在 32 位 CPU 上添加两个 64 位值如下。

将b的前 32 位添加到a的前 32 位。
将b的后 32位与a的后 32 位相加。

类比减法。

这给出了 2 条指令，但是，由于指令流水线，它可能比这慢，因为一个计算取决于另一个计算来完成，所以如果 CPU 除了 64 位加法之外没有其他事情要做，CPU 可能会等待进行第一次添加。

乘法

它发生在 x86 上，imul并且mul可以以溢出存储在edx寄存器中的方式使用。因此，将两个 32 位值相乘得到 64 位值非常容易。这样的乘法是一条指令，但要使用它，必须将乘法值之一存储在eax中。

无论如何，对于两个 64 位值相乘的更一般情况，可以使用以下公式计算它们（假设函数r删除超过 32 位的位）。

首先，很容易注意到结果的低 32 位将是被乘变量的低 32 位相乘。这是由于同余关系。

a ₁ ≡ b ₁ (mod n )
a ₂ ≡ b ₂ (mod n )
a ₁a ₂ ≡ b ₁b ₂ (mod n )

因此，该任务仅限于确定高 32 位。要计算结果的高 32 位，应将以下值相加。

两个低 32 位的高 32 位相乘（溢出，CPU 可以存储在edx中）
第一个变量的高 32 位乘以第二个变量的低 32 位
第一个变量的低 32 位乘以第二个变量的高 32 位

这提供了大约 5 条指令，但是由于 x86 中的寄存器数量相对有限（忽略对架构的扩展），它们不能充分利用流水线。如果您想提高乘法速度，请启用 SSE，因为这会增加寄存器的数量。

除法/模数（两者在实现上相似）

我不知道它是如何工作的，但它比加法、减法甚至乘法复杂得多。然而，它可能比 64 位 CPU 上的除法慢十倍。如果您能理解它，请查看“计算机编程艺术，第 2 卷：半数值算法”，第 257 页了解更多详细信息（不幸的是，我无法解释它）。

如果您除以 2 的幂，请参阅移位部分，因为这基本上是编译器可以优化除法的（加上在移位有符号数之前添加最高有效位）。

或/与/异或

考虑到这些操作是单位操作，这里没有什么特别的事情发生，只是按位操作执行了两次。

向左/向右移动

有趣的是，x86 实际上有一条指令来执行 64 位左移指令shld，它不是用零替换值的最低有效位，而是用不同寄存器的最高有效位替换它们。shrd同样，带指令的右移也是如此。这很容易使 64 位移位成为两个指令操作。

但是，这只是不断变化的情况。当移位不是恒定的时，事情会变得更加棘手，因为 x86 架构仅支持以 0-31 作为值的移位。除此之外的任何内容都根据未定义的官方文档，实际上，对值执行按位和 0x1F 操作。因此，当移位值高于 31 时，值存储之一将被完全擦除（对于左移，这是低字节，对于右移，这是高字节）。另一个获取寄存器中被擦除的值，然后执行移位操作。结果，这取决于分支预测器来做出良好的预测，并且速度有点慢，因为需要检查一个值。

__builtin_popcount[ll]

__builtin_popcount（较低）+ __builtin_popcount（较高）

其他内建

我懒得在这一点上完成答案。有人甚至使用这些吗？

未签名与已签名

加法，减法，乘法，或，和，异或，左移生成完全相同的代码。右移仅使用稍微不同的代码（算术移位与逻辑移位），但在结构上是相同的。然而，除法可能会生成不同的代码，并且有符号除法可能比无符号除法慢。

基准

基准？它们大多没有意义，因为当您不经常重复相同的操作时，指令流水线通常会导致事情变得更快。随意考虑除法很慢，但实际上并没有什么，当您超出基准测试时，您可能会注意到由于流水线，在 32 位 CPU 上执行 64 位操作一点也不慢。

对您自己的应用程序进行基准测试，不要相信与您的应用程序不兼容的微基准测试。现代 CPU 非常棘手，因此不相关的基准测试可以而且将会撒谎。

score 2 · Accepted Answer

你的问题在它的环境中听起来很奇怪。您使用的 time_t 占用了 32 位。您需要更多信息，这意味着更多位。所以你不得不使用比 int32 更大的东西。表现如何并不重要，对吧？选择将在仅使用 40 位或继续使用 int64 之间进行。除非必须存储数百万个实例，否则后者是一个明智的选择。

正如其他人指出的那样，了解真实性能的唯一方法是使用分析器来测量它，（在一些总样本中，一个简单的时钟就可以了）。所以请继续测量。将您的 time_t 使用全局替换为 typedef 并将其重新定义为 64 位并修补少数预期 real time_t 的实例一定不难。

除非您当前的 time_t 实例至少占用几兆内存，否则我的赌注将是“无法衡量的差异”。在当前的类似 Intel 的平台上，内核大部分时间都在等待外部存储器进入缓存。单个高速缓存未命中会停滞数百个周期。是什么让计算指令上的 1-tick 差异变得不可行。您的实际性能可能会下降，因为您当前的结构只适合一个缓存行，而较大的一个需要两个。如果你从未测量过你当前的性能，你可能会发现你可以通过在结构中添加一些成员的对齐或交换顺序来获得一些函数的极大加速。或者打包（1）结构而不是使用默认布局......

score 0 · Accepted Answer

加法/减法基本上每个变成两个周期，乘法和除法取决于实际的CPU。总体性能影响将相当低。

请注意，英特尔酷睿 2 支持 EM64T。