c++ - 从 for 循环优化函数调用

Question

我有一些简单的功能

int f_0(int);
int f_1(int);
...
int f_n(int);

然后我有一些 for 循环，我在其中调用 f_i()，这个循环中的条件不必相同

for (int i = 0; i < n; i++) {
   ...
   if (condition) {
      int myInt = f_i(); // this is not real implementation but shows the result
                         // I want to achieve
      ... //edit
   }
...
}

以下是我尝试实现的方法：

分解 for 循环并在相应部分调用每个函数。这会产生最快的代码，但这非常不雅，而且这样的代码很难进一步开发。
指向函数的指针

typedef int (*Foo) (int);

Foo fptr[] = { f_0, f_1, ... , f_n };

这是一种优雅的方法，但在我的情况下，它比打破循环慢 4.4。指向函数的常量指针产生类似的结果。

将我的功能封装成开关功能。这比打破循环慢 2.6。

有没有更好的方法来实现这一点？理想的解决方案是代码紧凑的解决方案，但编译器会分解循环并让计算速度最快。

我正在使用 MSVC 2012 并在发布模式下运行，优化设置为最大限度地提高速度。

编辑：

这是我的测试代码：

头文件

namespace c {
const int w = 1024;
const int A = w * w;
}

inline int f_0(int pos)  { return (pos - c::w + c::A) % c::A;           }
inline int f_1(int pos)  { return (pos + 1 - c::w + c::A) % c::A;       }
inline int f_2(int pos)  { return (pos + 1) % c::A;                     }
inline int f_3(int pos)  { return (pos + c::w) % c::A;                  }
inline int f_4(int pos)  { return (pos - 1 + c::w) % c::A;              }
inline int f_5(int pos)  { return (pos - 1 + c::A) % c::A;              }

typedef int (*NEIGH_F) (int);
typedef int (* const CNEIGH_F) (int);

const NEIGH_F  fptr[]  = { f_0, f_1, f_2, f_3, f_4, f_5 };
const CNEIGH_F cfptr[] = { f_0, f_1, f_2, f_3, f_4, f_5 };

inline int fswitch(int i, int pos) {
    switch(i) {
    case 0 : return f_0(pos); break;
    case 1 : return f_1(pos); break;
    case 2 : return f_2(pos); break;
    case 3 : return f_3(pos); break;
    case 4 : return f_4(pos); break;
    case 5 : return f_5(pos); break;
    default : return -1; break;
    }
}

主文件

#include "head.h"
#include <iostream>
#include <time.h>

int main()
{
    int maxRepeat = 100;

    clock_t startTime = clock();
    double sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            sum += f_0(i);
            sum += f_1(i);
            sum += f_2(i);
            sum += f_3(i);
            sum += f_4(i);
            sum += f_5(i);
        }
    std::cout << "normal time:        " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += fptr[j](i);
        }
    std::cout << "pointer time:       " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += cfptr[j](i);
        }
    std::cout << "const pointer time: " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += fswitch(j, i);
        }
    std::cout << "switch time:        " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;
    std::cin.ignore();

    return 0;
}

函数 f_i 是我在实际实现中使用的函数，但是由于实际实现中的测试目的，这里的循环要简单得多，问题的第二个代码片段中显示了几种不同的形式循环。

编辑2：

我的循环形式应该保持不变我只是想找到如何将 f_i 放入我的循环的最佳方法。

score 4 · Accepted Answer

您可以使用模板函数而不是f_0, f_1... 更好地维护。

template <int N>
void f();

template <>
void f<0>()
{
    printf("f<0>");
}

template <>
void f<1>()
{
    printf("f<1>");
}

int main() {
    f<0>();
    f<1>();
    //f<2>(); // this is compile error
    return 0;
}

但是，模板参数必须作为编译时常量提供，所以你不能调用函数int i = 0; f<i>()

要解决此问题，您可以使用 switch-case 调用函数，不是很漂亮，但可以

void call_f(int i)
{
    switch(i)
    {
        case 0:
            f<0>();
            break;
        case 1:
            f<1>();
            break;
        default:
            // invalid i, report error
            break;
    }
}

但是，没有编译时检查i

放在一起

score 2 · Accepted Answer

我认为 Bryan Chen 的基于模板的解决方案很有意义。它会更容易维护和理解。我赞成该解决方案。

也就是说，如果您想要一个没有 switch 语句的更通用的解决方案，并且您想以“展开”的方式测试所有条件，您可以使用模板的编译时递归。

我使用了 3 个函数，基于接受单个整数参数的Condition函子。显然，您可以根据需要使条件更简单或更复杂。

其核心涉及递归的模板定义，以及用于停止递归的模板特化：

template <int N>
struct Condition;  // provides bool operator()(int arg)

template <int N>
void f();

template <int N>
void applyFunctions(int arg);

// Specialization placed first for clarity
template <>
void applyFunctions<0>(int arg)
{
  if (Condition<0>()(arg))
  {
    f<0>();
  }
  // End recursion
};

template <int N>
void applyFunctions(int arg)
{
  if (Condition<N>()(arg))
  {
    f<N>();
  }

  applyFunctions<N - 1>(arg);
};

这是一些输出。短语在条件检查[f<i>]中打印，而在函数调用中打印。为了清楚起见，我对齐了打印输出。

Loop
j = 0:                       Is even. [f<1>]       Always true. [f<0>]
j = 1:                                             Always true. [f<0>]
j = 2:  Is prime. [f<2>]     Is even. [f<1>]       Always true. [f<0>]
j = 3:  Is prime. [f<2>]                           Always true. [f<0>]
j = 4:                       Is even. [f<1>]       Always true. [f<0>]
j = 5:  Is prime. [f<2>]                           Always true. [f<0>]
j = 6:                       Is even. [f<1>]       Always true. [f<0>]
j = 7:  Is prime. [f<2>]                           Always true. [f<0>]
j = 8:                       Is even. [f<1>]       Always true. [f<0>]
j = 9:                                             Always true. [f<0>]
j = 10:                      Is even. [f<1>]       Always true. [f<0>]

完整的程序如下。如果您真的想做一些很酷的事情，您可以使Condition结构具有以某种constexpr方式计算的成员变量，以便在编译时确定包含的结果代码。如果这对您没有任何意义，您可能需要阅读模板、模板实例化和元编程。

#include <iostream>
#include <iomanip>

static int fw = 20;

template <int N>
struct Condition;

template <int N>
void f();


// Specialization 0
template <>
struct Condition<0>
{
  bool operator() (int arg)
  {
    std::cout << std::setw(fw) << " Always true. ";
    return true;
  }
};

template <>
void f<0>()
{
  std::cout << "[f<0>]";
}

// Specialization 1
template <>
struct Condition<1>
{
  bool operator() (int arg)
  {
    bool isEven = (arg % 2 == 0);
    if (isEven)
      std::cout << std::setw(fw) << " Is even. ";
    else 
      std::cout << std::setw(fw) << " ";
    return isEven;
  }
};

template <>
void f<1>()
{
  std::cout << "[f<1>]";
}


// Specialization 2
template <>
struct Condition<2>
{
  bool operator() (int arg)
  {
    bool isPrime = (arg == 2 || arg == 3 || arg == 5 || arg == 7);
    if (isPrime)
      std::cout << std::setw(fw) << " Is prime. ";
    else 
      std::cout << std::setw(fw) << " ";
    return isPrime;
  }
};

template <>
void f<2>()
{
  std::cout<< "[f<2>]";
}


template <int N>
void applyFunctions(int arg);

template <>
void applyFunctions<0>(int arg)
{
  if (Condition<0>()(arg))
  {
    f<0>();
  }
  // End recursion
};

template <int N>
void applyFunctions(int arg)
{
  if (Condition<N>()(arg))
  {
    f<N>();
  }

  applyFunctions<N - 1>(arg);
};


int main()
{
  applyFunctions<2>(4);

  std::cout << std::endl << "Loop" << std::endl;
  for (int j = 0; j < 11; ++j)
  {
    std::cout << "j = " << j << ": ";
    applyFunctions<2>(j);
    std::cout << std::endl;
  }
}

score 2 · Accepted Answer

以下两个调整从根本上改变了程序的结果输出（感谢干净的编译代码！）。这些表明性能优化在构建时与运行时不确定性之间有一个明确的权衡：如果你知道你将调用什么函数，或者你将在什么目标机器上运行，你可以编写更优化的代码。

通过指针调用函数使您可以灵活地在运行时调用函数，但代价是不内联函数调用。修改对以下的调用使指针时间等于正常时间。

normal time:        1.36  sum is: 3.29853e+14
pointer time:       1.36  sum is: 3.29853e+14
const pointer time: 1.35  sum is: 3.29853e+14
switch time:        1.14  sum is: 3.29853e+14

更改正在展开循环中的函数调用，因此：

   sum += fptr[1](i);
   sum += fptr[2](i);
   sum += fptr[3](i);
   sum += fptr[4](i);
   sum += fptr[5](i);

fswitch()对于您展示的情况，可能比正常情况要快，可能是因为内联fswitch()创建了一组被缓存的指令。也许具有必要专业知识的人可以通过反汇编生成的可执行文件来证明这一点。在我的测试中，我稍微扩大了 switch 函数（switch通过复制它们的双分支，如下所示），发现它的运行速度比正常慢大约 4 倍：

normal time:        2.35  sum is: 6.59706e+14
pointer time:       2.35  sum is: 6.59706e+14
const pointer time: 2.34  sum is: 6.59706e+14
switch time:        9.61  sum is: 6.59706e+14

变化是：

case 6 : return f_0(pos); break;
case 7 : return f_1(pos); break;
case 8 : return f_2(pos); break;
case 9 : return f_3(pos); break;
case 10 : return f_4(pos); break;
case 11 : return f_5(pos); break;

...

for (int j = 0; j < 12; j++)
    sum += fswitch(j, i);

...

const NEIGH_F  fptr[]  = { f_0, f_1, f_2, f_3, f_4, f_5, f_0, f_1, f_2, f_3, f_4, f_5 };
const CNEIGH_F cfptr[] = { f_0, f_1, f_2, f_3, f_4, f_5, f_0, f_1, f_2, f_3, f_4, f_5 };

...

for (int j = 0; j < 12; j++)
    sum += fptr[j](i);

...

等等

score 1 · Accepted Answer

f_i()函数和A常量w真的是给定的吗？因为如果是的话，这个问题不是可以简单地简化为表查找、加法和按位与吗？

/* Includes */
#include <stdio.h>
#include <time.h>


/* Constants */
const int w = 1024;
const int A = 1024*1024;
const int addconst[6] = {0xFFC00, 0xFFC01, 0x00001, 0x00400, 0x003FF, 0xFFFFF};
                      /*     A-w,   A-w+1,       1,       w,     w-1,     A-1 */

/* THE NOVELTY */
int ftable(int i, int pos){
    return (pos + addconst[i]) & 0xFFFFF;
}

/* Main */
int main(int argc, char* argv[]){
    clock_t timeTaken;
    int     repeat, maxRepeat = 100;
    int     i, j;
    long    sum = 0;

    timeTaken  = -clock();
    for(repeat=0;repeat<maxRepeat;repeat++)
        for(i=0;i<A;i++)
            for(j=0;j<6;j++)
                sum += ftable(j, i);
    timeTaken += clock();

    printf("Stop! Hammertime!        %f  sum is: %f\n",
           timeTaken/(double)CLOCKS_PER_SEC, (double)sum);
    return 0;
}

请注意，当sum变量为 a时long，所用时间为：

Stop! Hammertime!        0.348295  sum is: 329853173760000.000000

而当它是 a 时double，它需要两倍以上的时间：

Stop! Hammertime!        0.861563  sum is: 329853173760000.000000

我的编译标志是：

gcc -O3 -funroll-loops -finline-functions tmp.c -o tmp

如果您可以进一步解释函数索引如何依赖循环索引，我可以进行更多优化。

c++ - 从 for 循环优化函数调用

4 回答 4

Related

Reference