c++ - 逐元素向量乘法

Question

什么是更快（就代码所需的时间而言）替代以下用于乘以两个n元素整数向量的方法的方法：

{
// code for obtaining two n element int vectors, a and b
}
int temp = 0;  // a temporary variable
for (int ii = 0; ii < n; ++ii)
    temp += a[ii]*b[ii];

编辑：收到了几个好主意。我必须检查每一个，看看哪一个是最好的。当然，每个回答都告诉我一些新的东西。

score 6 · Accepted Answer

我能看到在 C++ 中加快这一速度的唯一方法是修复“循环携带依赖”，这意味着每次迭代都必须等到前一个临时值可用于求和。这可以通过展开和使用几个累积变量来完成：

int t0=0, t1=0, t2=0, t3=0;
for (int ii = 0; ii < n; ii += 4) {
    t0 += a[ii]*b[ii];
    t1 += a[ii+1]*b[ii+1];
    t2 += a[ii+2]*b[ii+2];
    t3 += a[ii+3]*b[ii+3];
}
int temp = t0 + t1 + t2 + t3;

现代处理器每个周期可以执行多个操作，但前提是不存在依赖关系。在我的系统上，这会产生大约 20% 的改进。注意：n 必须是 4 的倍数，否则您需要添加一个循环“epilog”来完成剩余的元素。测试和测量！我不知道 4 是否是“正确”的展开量。

您可以通过调用特定于处理器的 SIMD 内在函数来获得更多改进，但这不是标准的 C++。

score 2 · Accepted Answer

Just use the C++ Standard Library:

#include<algorithm>
#include<iostream>
#include<vector>

int main() {
  std::vector<double> x = {1, 2, 3};
  std::vector<double> y = {4, 5, 6};
  double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0);
  std::cout<<"inner product: "<<xy<<std::endl;

  return 0;

}

Edit: added timing information.

Just out of curiosity I added the following timing information.

The source code:

#include<algorithm>
#include<iostream>
#include<vector>
#include<random>
#include<boost/timer/timer.hpp>

int main(int argc, char* argv[]) {
  // get the desired number of elements
  if(argc!=2) {
    std::cerr<<"usage: "<<argv[0]<<" N"<<std::endl;
    return EXIT_FAILURE;
  }

  int N = std::stoi(argv[1]);

  // set-up the random number generator
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_real_distribution<> dis(-100, 100);

  // prepare the vectors
  std::vector<double> x, y;

  // fill the vectors with random numbers
  auto rgen = [&dis, &gen]() { return dis(gen); };  
  std::generate_n(std::back_inserter(x), N, rgen);
  std::generate_n(std::back_inserter(y), N, rgen);

  // Heat-up the cache (try commenting-out this line and you'll see
  // that the time increases for whatever algorithm you put firts)
  double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0.0);
  std::cout<<"heated-up value: "<<xy<<std::endl;

  { // start of new timing scope
    // write a message to the assembly source
    boost::timer::auto_cpu_timer t;
    asm("##### START OF ALGORITHMIC APPROACH #####");
    double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0.0);
    asm("##### END OF ALGORITHMIC APPROACH #####");
    std::cout<<"algorithmic value: "<<xy<<std::endl<<"timing info: ";
  } // end of timing scope


  { // start of new timing scope
    // write a message to the assembly source
    boost::timer::auto_cpu_timer t;
    asm("##### START OF HAND-CODED APPROACH #####");
    double tmp = 0.0;
    for(int k=0; k<N; k++) {
      tmp += x[k] * y[k];
    }
    asm("##### END OF HAND-CODED APPROACH #####");
    std::cout<<"hand-coded value: "<<tmp<<std::endl<<"timing info: ";
  } // end of timing scope


  return EXIT_SUCCESS;  
}

The environment: 2.7 GHz Intel Core i7; OS X 10.7.4; gcc 4.8.1;

The compilation command: g++ -O3 inner-product-assembly.cpp -std=c++11 -lboost_timer -lboost_system

Sample runs:

[11:01:58 ~/research/c++] ./a.out 10
heated-up value: 8568.75
algorithmic value: 8568.75
timing info:  0.000006s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 8568.75
timing info:  0.000004s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:01:59 ~/research/c++] ./a.out 100
heated-up value: -13072.2
algorithmic value: -13072.2
timing info:  0.000006s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: -13072.2
timing info:  0.000004s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:03 ~/research/c++] ./a.out 1000
heated-up value: 80389.1
algorithmic value: 80389.1
timing info:  0.000010s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 80389.1
timing info:  0.000007s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:04 ~/research/c++] ./a.out 10000
heated-up value: 89753.7
algorithmic value: 89753.7
timing info:  0.000041s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 89753.7
timing info:  0.000039s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:05 ~/research/c++] ./a.out 100000
heated-up value: -461750
algorithmic value: -461750
timing info:  0.000292s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: -461750
timing info:  0.000282s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:07 ~/research/c++] ./a.out 1000000
heated-up value: 2.52643e+06
algorithmic value: 2.52643e+06
timing info:  0.002702s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 2.52643e+06
timing info:  0.002660s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:09 ~/research/c++] ./a.out 10000000
heated-up value: 6.04128e+06
algorithmic value: 6.04128e+06
timing info:  0.026557s wall, 0.030000s user + 0.000000s system = 0.030000s CPU (113.0%)
hand-coded value: 6.04128e+06
timing info:  0.026335s wall, 0.030000s user + 0.000000s system = 0.030000s CPU (113.9%)

[11:02:11 ~/research/c++] ./a.out 100000000
heated-up value: 2.27043e+07
algorithmic value: 2.27043e+07
timing info:  0.264547s wall, 0.270000s user + 0.000000s system = 0.270000s CPU (102.1%)
hand-coded value: 2.27043e+07
timing info:  0.264346s wall, 0.260000s user + 0.000000s system = 0.260000s CPU (98.4%)

So, the speed advantage of using the hand-coded approach seems to matter only if you use smaller arrays. Past the 10,000 mark, I would consider their run time identical, but I prefer the algorithmic approach because it is simpler to write and maintain, and it may benefit from updates to the library.

As usual, this timing info should be taken with a grain of salt.

score 1 · Accepted Answer

#include <valarray>

int main() {
  std::valarray<int> a {1, 2, 3, 4, 5};
  std::valarray<int> b {3, 4, 5, 6, 7};

  auto c = (a * b).sum();
}

valarray can use SIMD and other vector instructions, although it seems to be regularly neglected in common implementations.

score 0 · Accepted Answer

Quick answer: That's probably the fastest solution. Long Answer: there are two definitions of fast. If you're looking a quick and uncomplicated solution the above code should suit well. If you're looking for a fast running code, I don't know if reimplementing the for-loop with a while-loop or using libraries/packages will help. Good Luck!

c++ - 逐元素向量乘法

4 回答 4

Edit: added timing information.

Related

Reference