c++ - 最快的 64 位阶乘实现导致汇编

Question

这不是家庭作业，只是我想到的。因此，直接计算阶乘并不是很快；记忆化会有所帮助，但如果结果适合 32 位或 64 位，则阶乘只能分别用于输入0和。所以......我们不妨使用查找表：1220

n   n!
0   1       
1   1       
2   2       
3   6       
4   24      
5   120     
6   720     
7   5040        
8   40320       
9   362880      
10  3628800     
11  39916800        
12  479001600       
13  6227020800  2^32=   4294967296
14  87178291200     
15  1.30767E+12     
16  2.09228E+13     
17  3.55687E+14     
18  6.40237E+15     
19  1.21645E+17     
20  2.4329E+18      
        2^64=   1.84467E+19

所以，假设我想要一个使用内联汇编的内联 C++ 阶乘函数，结果是 32 位或 64 位无符号整数。如果输入为负数或大到足以导致溢出，则输出应为 0。如何在汇编中做到这一点，以便消耗最少的周期？此代码将在 64 位 Intel/AMD 架构上运行。如果可行，我有兴趣改善最坏的情况，因此20!计算时间不应该比0!- 希望有一种二进制搜索方法。希望有一个聪明的技巧来做if (n == 0 || n == 1) { return 1; }。另外，如果输出需要是 32 位的，那么我认为汇编指令可以同时包含代码和数据。我的组装知识很薄弱。如果这个问题没有多大意义，请告诉我。

能够在 C++ 中使用该函数会很好 - 使它成为一个更现实的问题。例如，如果调用一个函数很昂贵，那么尝试在程序集主体中节省 1-2 个时钟周期将无济于事。

score 12 · Accepted Answer

我巧妙地在汇编中构建了一个查找表。以防万一你的程序集生锈了，例程希望参数在ecx寄存器中。我验证它是否在范围内，然后将查找表的值读入eax和edx寄存器。如果值超出范围，我只需将 and 与自身进行异eax或edx注册（这会强制它们为 0）。不幸的是，由于它是一个汇编例程，编译器将无法内联代码。但是，我确信通过编写出色的汇编程序节省的几个周期将弥补内联带来的任何好处。

factorial:
    xorl    %eax, %eax
    xorl    %edx, %edx
    cmpl    $20, %ecx
    ja  .TOOBIG
    movl    CSWTCH.1(,%ecx,8), %eax
    movl    CSWTCH.1+4(,%ecx,8), %edx
.TOOBIG:

LOOKUP_TABLE:
    .section    .rodata
    .align 32
    .type   CSWTCH.1, @object
    .size   CSWTCH.1, 168
CSWTCH.1:
    .long   1
    .long   0
    .long   1
    .long   0
    .long   2
    .long   0
    .long   6
    .long   0
    .long   24
    .long   0
    .long   120
    .long   0
    .long   720
    .long   0
    .long   5040
    .long   0
    .long   40320
    .long   0
    .long   362880
    .long   0
    .long   3628800
    .long   0
    .long   39916800
    .long   0
    .long   479001600
    .long   0
    .long   1932053504
    .long   1
    .long   1278945280
    .long   20
    .long   2004310016
    .long   304
    .long   2004189184
    .long   4871
    .long   -288522240
    .long   82814
    .long   -898433024
    .long   1490668
    .long   109641728
    .long   28322707
    .long   -2102132736
    .long   566454140

查找表很难维护，所以我已经包含了我用来构建它的脚本

static constexpr uint64_t const_factorial(uint32_t i) {
    return (i==0)? 1: (i * const_factorial(i-1));
}

uint64_t factorial(uint32_t i) {
    switch(i) {
        case 0: return const_factorial(0);
        case 1: return const_factorial(1);
        case 2: return const_factorial(2);
        case 3: return const_factorial(3);
        case 4: return const_factorial(4);
        case 5: return const_factorial(5);
        case 6: return const_factorial(6);
        case 7: return const_factorial(7);
        case 8: return const_factorial(8);
        case 9: return const_factorial(9);
        case 10: return const_factorial(10);
        case 11: return const_factorial(11);
        case 12: return const_factorial(12);
        case 13: return const_factorial(13);
        case 14: return const_factorial(14);
        case 15: return const_factorial(15);
        case 16: return const_factorial(16);
        case 17: return const_factorial(17);
        case 18: return const_factorial(18);
        case 19: return const_factorial(19);
        case 20: return const_factorial(20);
        default: return 0;
    }
}

以防你在我拙劣的幽默尝试中错过了它。C++ 编译器能够正确优化您的代码。如您所见，我不需要对查找表、二叉搜索树或散列做任何花哨的事情。只是一个简单的switch语句，编译器完成了其余的工作。

score 5 · Accepted Answer

我已经有一段时间没有锻炼我的装配肌肉了，所以我只是提供一些一般性的建议。

由于您事先确切知道所有项目的数量和大小，因此只需制作一个连续的值数组（硬编码或预先计算）。在验证函数的输入（< 0 或 > 12/20）后，您可以使用简单的偏移寻址来检索适当的值。这将在 O(1) 时间内起作用。

score 1 · Accepted Answer

从 2021 年开始更新。手头有 C++17。

我想没有比下面更快的方法了。不需要汇编程序。

因为适合无符号 64 位值的阶乘数非常少 (21)，所以编译时 constexpr 数组将主要使用 21*8 = 168 字节。

168 字节

这个数字太低了，我们可以轻松构建编译时间constexpr std::array并停止所有进一步的考虑。

真的一切都可以在编译时完成。

我们将首先定义将阶乘计算为constexpr函数的默认方法：

constexpr unsigned long long factorial(unsigned long long n) noexcept {
    return n == 0ull ? 1 : n * factorial(n - 1ull);
}

这样，可以在编译时轻松计算阶乘。std::array然后，我们用所有阶乘填充 a 。我们还使用 aconstexpr并使其成为带有可变参数包的模板。

我们用来std::integer_sequence为索引 0,1,2,3,4,5, ... 创建一个阶乘。

这很简单，并不复杂：

template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
    return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};

该函数将输入一个整数序列 0,1,2,3,4,... 并返回std::array<unsigned long long, ...>带有相应阶乘的 a。

我们知道我们最多可以存储 21 个值。因此我们创建了一个下一个函数，它将使用整数序列 1,2,3,4,...,20,21 调用上述函数，如下所示：

constexpr auto generateArray()noexcept {
    return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}

而现在，终于，

constexpr auto Factorial = generateArray();

将为我们提供一个名为 Factorial 的编译时std::array<unsigned long long, 21>，其中包含所有阶乘。如果我们需要第 i 个阶乘，那么我们可以简单地写Factorial[i]. 运行时不会进行计算。

我认为没有更快的方法来计算阶乘。

请参阅下面的完整程序：

#include <iostream>
#include <array>
#include <utility>
// ----------------------------------------------------------------------
// All the below will be calculated at compile time
// constexpr factorial function
constexpr unsigned long long factorial(unsigned long long n) noexcept {
    return n == 0ull ? 1 : n * factorial(n - 1ull);
}
// We will automatically build an array of factorials at compile time
// Generate a std::array with n elements 
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
    return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
// Max index for factorials for an 64bit unsigned value 
constexpr size_t MaxIndexFor64BitValue = 21;

// Generate the required number of elements
constexpr auto generateArray()noexcept {
    return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
// This is an constexpr array of all factorials numbers
constexpr auto Factorial = generateArray();

// All the above was compile time
// ----------------------------------------------------------------------

// Test function
int main() {
    for (size_t i{}; i < MaxIndexFor64BitValue; ++i)
        std::cout << i << '\t' << Factorial[i] << '\n';
    return 0;
}

使用 Microsoft Visual Studio Community 2019 版本 16.8.2 开发、编译和测试

另外使用 gcc 10.2 和 clang 11.0.1 编译和测试

语言：C++17

score 0 · Accepted Answer

如果您只处理 0-19 之间的数字，那么哈希表或二叉树就太过分了。只需创建一个unsigned int[20]然后查询索引：

const unsigned int FACTORIALS[20] = {1,1,2,6,24,120,etc..};

unsigned int factorial(unsigned int num) {
    if(num >= 0 && num <= 19) {
        return FACTORIALS[num];
    }
    else {
        throw // some sort of exception
    }
}

您也可以使用模板来构建数组。

score 0 · Accepted Answer

gcc 的回答

...这可能是你的，编译自：

uint64_t answers[] = {
    1ULL,
    1ULL,
    2ULL,
    6ULL,
    24ULL,
    ...
    2432902008176640000ULL,
};

uint64_t factorial(unsigned int i) {
    if(i >= sizeof(answers) / sizeof(*answers))
        return 0;
    else
        return answers[i];
}

...和大会...

factorial:
    cmpl    $20, %edi
    movl    $0, %eax
    ja  .L3
    movslq  %edi,%eax
    movq    answers(,%rax,8), %rax
.L3:
    rep
    ret
answers:
    .quad 1
    .quad 1
    ...

...这似乎是第一个 64 位汇编器回答...

score 0 · Accepted Answer

谁说你的汇编版本无论如何都会比 C++ 版本快。事实上，谁说它甚至会在速度上匹配？我敢打赌 100 美元，你甚至无法做到像编译器那样快。

score 0 · Accepted Answer

根据大众的需求，在性能方面它被认为是二进制搜索，而不是哈希表（我相信标准 C++ 没有）。

#include <map>

void main()
{
    std::map<int, BigIntThing> factMap;
    // insert all elements here, probably fancier ways to do this
    factMap.insert( 1 );
    factMap.insert( 1 );
    factMap.insert( 2 );
    // ....
    // to access, say 15!
    BigIntThing factMap[15]; // I think the index is right >_<
}

而已。Astd::map是有序的，所以如果您的 BigIntThing 有一个比较运算符，那么一切都很好。应该有一种方法可以得到这个const和/或static和/或global以你想要的方式编译它。

c++ - 最快的 64 位阶乘实现导致汇编

7 回答 7

gcc 的回答

Related

Reference