c++ - Boost::multi_array 性能问题

Question

我正在尝试使用以下测试程序将 boost::multi_array 的性能与本机动态分配的数组进行比较：

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure native-----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    return 0;
}

我得到以下结果：

[Boost] Elapsed time: 12.500 seconds
[Native]Elapsed time:  0.062 seconds

我不敢相信 multi_arrays 慢得多。谁能发现我做错了什么？

我认为缓存不是问题，因为我正在写入内存。

编辑：这是一个调试版本。根据 Laserallan 的建议，我做了一个发布版本：

[Boost] Elapsed time:  0.266 seconds
[Native]Elapsed time:  0.016 seconds

更近了。但是 16 比 1 对我来说似乎仍然很高。

好吧，没有明确的答案，但我将继续前进，暂时将我的真实代码保留为本机数组。

接受 Laserallan 的回答，因为这是我测试中最大的缺陷。

谢谢大家。

score 34 · Accepted Answer

在我的机器上使用

g++ -O3 -march=native -mtune=native --fast-math -DNDEBUG test.cpp -o test && ./test

我明白了

[Boost] Elapsed time:  0.020 seconds
[Native]Elapsed time:  0.020 seconds

但是改变const int ITERATIONS为5000我得到

[Boost] Elapsed time:  0.240 seconds
[Native]Elapsed time:  0.180 seconds

然后ITERATIONS回到500butX_SIZE和Y_SIZEset to400我得到一个更显着的差异

[Boost] Elapsed time:  0.460 seconds
[Native]Elapsed time:  0.070 seconds

最后为案例反转内部循环，[Boost]所以它看起来像

    for (int x = 0; x < X_SIZE; ++x)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {

和保持ITERATIONS,X_SIZE和Y_SIZE,我得到500400400

[Boost] Elapsed time:  0.060 seconds
[Native]Elapsed time:  0.080 seconds

[Native]如果我也为这种情况反转内部循环（所以这种情况的顺序错误），不出所料，我得到，

[Boost] Elapsed time:  0.070 seconds
[Native]Elapsed time:  0.450 seconds

我gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5在 Ubuntu 10.10 上使用

所以总而言之：

通过适当的优化boost::multi_array 可以按预期完成工作
您访问数据的顺序很重要

score 11 · Accepted Answer

你的测试有缺陷。

在 DEBUG 构建中，boost::MultiArray 缺少它急需的优化通道。（比原生数组要多得多）
在 RELEASE 构建中，您的编译器将查找可以直接删除的代码，并且您的大部分代码都属于该类别。

您可能会看到优化编译器的结果，即可以删除大部分或全部“本机数组”循环。你的 boost::MultiArray 循环在理论上也是如此，但 MultiArray 可能足够复杂，足以打败你的优化器。

对你的测试平台做这个小改动，你会看到更真实的结果：= 2.345用“ *= 2.345”改变“”的出现，然后用优化再次编译。这将防止您的编译器发现每个测试的外循环是多余的。

我做到了，速度比较接近 2:1。

score 8 · Accepted Answer

您是在构建发布版还是调试版？

如果在调试模式下运行，boost 数组可能会非常慢，因为它们的模板魔法没有正确内联，从而在函数调用中产生大量开销。我不确定多数组是如何实现的，所以这可能完全关闭:)

也许存储顺序也存在一些差异，因此您可能将图像逐列存储并逐行写入。这会导致缓存行为不佳，并可能减慢速度。

尝试切换 X 和 Y 循环的顺序，看看是否有任何收获。这里有一些关于存储排序的信息：http: //www.boost.org/doc/libs/1_37_0/libs/multi_array/doc/user.html

编辑：由于您似乎正在使用二维数组进行图像处理，因此您可能有兴趣查看 boosts 图像处理库gil。

它可能具有开销较小的数组，非常适合您的情况。

score 4 · Accepted Answer

考虑改用 Blitz++。我试用了 Blitz，它的性能与 C 风格的数组相当！

使用下面添加的 Blitz 查看您的代码：

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
#include <blitz/array.h>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);


    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure blitz-----------------------------------------------
    blitz::Array<double, 2> blitzArray( X_SIZE, Y_SIZE );
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                blitzArray(x,y) = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Blitz] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);


    //------------------Measure native-----------------------------------------------
    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);



    return 0;
}

这是调试和发布的结果。

调试：

Boost  2.093 secs 
Blitz  0.375 secs 
Native 0.078 secs

释放：

Boost  0.266 secs
Blitz  0.016 secs
Native 0.015 secs

为此，我使用了 MSVC 2008 SP1 编译器。

我们现在可以和 C-stlye 数组说再见了吗？=p

score 4 · Accepted Answer

我想知道两件事：

1) 边界检查：在应用程序中包含 multi_array.hpp 之前定义 BOOST_DISABLE_ASSERTS 预处理器宏。这将关闭边界检查。不确定这是否在 NDEBUG 被禁用时被禁用。

2）基数索引：MultiArray可以从不为0的基数索引数组。这意味着multi_array存储一个基数（在每个维度中）并使用更复杂的公式来获取内存中的确切位置，我想知道这是否是关于那。

否则我不明白为什么多数组应该比 C 数组慢。

score 3 · Accepted Answer

我在看这个问题，因为我有同样的问题。我有一些想法要进行更严格的测试。

正如rodrigob指出的那样，循环顺序存在缺陷，因此您最初附加的代码中的任何结果都会产生误导性数据
此外，还有一些相当小的数组正在使用常量进行设置。编译器可能会优化循环，而实际上编译器不会知道数组的大小。数组的大小和迭代次数应该是运行时输入以防万一。

在 Mac 上，以下代码被配置为给出更有意义的答案。这里有4个测试。

#define BOOST_DISABLE_ASSERTS
#include "boost/multi_array.hpp"
#include <sys/time.h>
#include <stdint.h>
#include<string>

uint64_t GetTimeMs64()
{
  struct timeval tv;

  gettimeofday( &tv, NULL );

  uint64_t ret = tv.tv_usec;
  /* Convert from micro seconds (10^-6) to milliseconds (10^-3) */
  ret /= 1000;

  /* Adds the seconds (10^0) after converting them to milliseconds (10^-3) */
  ret += ( tv.tv_sec * 1000 );

  return ret;

}


void function1( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1p = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int xy = 0 ; xy < X_SIZE*Y_SIZE ; ++xy )
    {
      nativeMatrix1p[xy] += nativeMatrix1add[xy];
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native Pointer]    Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

void function2( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1 = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix1[y + ( x * Y_SIZE )] += nativeMatrix1add[y + ( x * Y_SIZE )];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 1D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}


void function3( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix2add[X_SIZE][Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  double nativeMatrix2[X_SIZE][Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix2[x][y] += nativeMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 2D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}



void function4( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  boost::multi_array<double, 2> boostMatrix2add( boost::extents[X_SIZE][Y_SIZE] );

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      boostMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  boost::multi_array<double, 2> boostMatrix( boost::extents[X_SIZE][Y_SIZE] );
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        boostMatrix[x][y] += boostMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Boost Array]       Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

int main( int argc, char* argv[] )
{

  srand( time( NULL ) );

  const int X_SIZE = std::stoi( argv[1] );
  const int Y_SIZE = std::stoi( argv[2] );
  const int ITERATIONS = std::stoi( argv[3] );

  function1( X_SIZE, Y_SIZE, ITERATIONS );
  function2( X_SIZE, Y_SIZE, ITERATIONS );
  function3( X_SIZE, Y_SIZE, ITERATIONS );
  function4( X_SIZE, Y_SIZE, ITERATIONS );

  return 0;
}

一个只有一维数组，使用带有整数数学和双循环的 []
一个使用指针递增的具有相同一维数组的
多维 C 数组
提升多数组

所以从命令行运行，运行

./test_array xsize ysize iterations"

并且您可以很好地了解这些方法将如何执行。这是我使用以下编译器标志得到的结果：

g++4.9.2 -O3 -march=native -funroll-loops -mno-avx --fast-math -DNDEBUG  -c -std=c++11


./test_array 51200 1 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  2.045 seconds
[Native 2D Array]   Elapsed time:  2.749 seconds
[Boost Array]       Elapsed time:  1.167 seconds

./test_array 25600 2 20000
[Native 1-Loop ]    Elapsed time:  0.531 seconds
[Native 1D Array]   Elapsed time:  1.241 seconds
[Native 2D Array]   Elapsed time:  1.631 seconds
[Boost Array]       Elapsed time:  0.954 seconds

./test_array 12800 4 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  1.214 seconds
[Native 2D Array]   Elapsed time:  1.223 seconds
[Boost Array]       Elapsed time:  0.798 seconds

./test_array 6400 8 20000
[Native 1-Loop ]    Elapsed time:  0.540 seconds
[Native 1D Array]   Elapsed time:  0.845 seconds
[Native 2D Array]   Elapsed time:  0.878 seconds
[Boost Array]       Elapsed time:  0.803 seconds

./test_array 3200 16 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  0.661 seconds
[Native 2D Array]   Elapsed time:  0.673 seconds
[Boost Array]       Elapsed time:  0.708 seconds

./test_array 1600 32 20000
[Native 1-Loop ]    Elapsed time:  0.532 seconds
[Native 1D Array]   Elapsed time:  0.592 seconds
[Native 2D Array]   Elapsed time:  0.596 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 800 64 20000
[Native 1-Loop ]    Elapsed time:  0.546 seconds
[Native 1D Array]   Elapsed time:  0.594 seconds
[Native 2D Array]   Elapsed time:  0.606 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 400 128 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  0.560 seconds
[Native 2D Array]   Elapsed time:  0.564 seconds
[Boost Array]       Elapsed time:  0.746 seconds

所以，我认为可以肯定地说 boost multi_array 的表现相当不错。没有什么比单循环评估更好的了，但是根据数组的维度，boost::multi_array 可能会击败具有双循环的标准 c 数组。

score 2 · Accepted Answer

2

要尝试的另一件事是使用迭代器而不是 boost 数组的直接索引。

于 2009-01-15T14:24:49.277 回答

score 2 · Accepted Answer

我本来希望多阵列同样有效。但是我在使用 gcc 的 PPC Mac 上得到了类似的结果。我还尝试了 multiarrayref，这样两个版本都使用相同的存储，没有区别。很高兴知道这一点，因为我在一些代码中使用了多数组，并且只是假设它类似于手动编码。

score 1 · Accepted Answer

我想我知道问题是什么……也许吧。

为了使 boost 实现具有如下语法：matrix[x][y]。这意味着 matrix[x] 必须返回对对象的引用，该对象的作用类似于一维数组column，此时 reference[y] 为您提供元素。

这里的问题是您以行主要顺序进行迭代（这在 c/c++ 中很典型，因为本机数组是行主要 IIRC。在这种情况下，编译器必须为每个 y 重新执行 matrix[x]。如果您在使用 boost 矩阵时的列主顺序，您可能会看到更好的性能。

只是一个理论。

编辑：在我的 linux 系统上（有一些小的改动）我测试了我的理论，并且通过切换 x 和 y 确实显示了一些性能改进，但它仍然比本机数组慢。这可能是编译器无法优化临时引用类型的简单问题。

score 1 · Accepted Answer

在发布模式下构建，使用 objdump，然后查看程序集。他们可能正在做完全不同的事情，您将能够看到编译器正在使用哪些优化。

score 1 · Accepted Answer

在这里提出并回答了类似的问题：

http://www.codeguru.com/forum/archive/index.php/t-300014.html

简短的回答是编译器最容易优化简单数组，而不是优化 Boost 版本那么容易。因此，特定的编译器可能不会为 Boost 版本提供所有相同的优化优势。

编译器的优化程度和保守程度也会有所不同（例如，使用模板代码或其他复杂情况）。

score 1 · Accepted Answer

我在 Snow Leopard Mac OS 上使用gcc 4.2.1

Debug:
[Boost] Elapsed time:  2.268 seconds
[Native]Elapsed time:  0.076 seconds

Release:
[Boost] Elapsed time:  0.065 seconds
[Native]Elapsed time:  0.020 seconds

这是代码（修改后可以在 Unix 上编译）：

#define BOOST_DISABLE_ASSERTS
#include <boost/multi_array.hpp>
#include <ctime>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    //------------------Measure native-----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    return 0;
}

score 1 · Accepted Answer

查看 g++ 4.8.2 生成的程序集，-O3 -DBOOST_DISABLE_ASSERTS使用并使用operator()和[][]访问元素的方式，很明显，与原生数组和手动索引计算相比，唯一的额外操作是添加基数。不过我没有衡量这个成本。

score 0 · Accepted Answer

我在 Visual Studio 2008 v9.0.21022 中修改了上述代码，并应用了 C 和 C++ 的数值配方例程中的容器例程

http://www.nrbook.com/nr3/分别使用他们的许可例程 dmatrix 和 MatDoub

dmatrix 使用过时的语法 malloc 运算符，不推荐使用... MatDoub 使用 New 命令

以秒为单位的速度在发布版本中：

提升：0.437

原生：0.032

数字配方 C：0.031

数值配方 C++：0.031

所以从上面的闪电战看来是最好的免费替代品。

score 0 · Accepted Answer

我已经在 VC++ 2010 下编译了代码（稍作修改），并启用了优化（“最大化速度”以及内联“任何合适的”函数和“偏爱快速代码”），得到的时间为 0.015/0.391。我已经生成了装配清单，虽然我是一个糟糕的装配菜鸟，但在 boost-measuring 循环中有一行对我来说看起来不太好：

call    ??A?$multi_array_ref@N$01@boost@@QAE?AV?$sub_array@N$00@multi_array@detail@1@H@Z ; boost::multi_array_ref<double,2>::operator[]

[] 运算符之一没有内联！被调用的过程再次调用，这次是multi_array::value_accessor_n<...>::access<...>()：

call    ??$access@V?$sub_array@N$00@multi_array@detail@boost@@PAN@?$value_accessor_n@N$01@multi_array@detail@boost@@IBE?AV?$sub_array@N$00@123@U?$type@V?$sub_array@N$00@multi_array@detail@boost@@@3@HPANPBIPBH3@Z ; boost::detail::multi_array::value_accessor_n<double,2>::access<boost::detail::multi_array::sub_array<double,1>,double *>

总之，这两个过程是相当多的代码，用于简单地访问数组中的单个元素。我的总体印象是该库是如此复杂和高级，以至于 Visual Studio 无法按照我们的意愿对其进行优化（使用 gcc 的海报显然得到了更好的结果）。

恕我直言，一个好的编译器真的应该对这两个过程进行内联和优化——两者都非常简短和直接，不包含任何循环等。很多时间可能只是浪费在传递它们的参数和结果上。

score 0 · Accepted Answer

正如 rodrigob 所回答的，激活适当的优化（GCC 的默认值为 -O0）是获得良好性能的关键。此外，我还使用Blaze DynamicMatrix进行了测试，它使用完全相同的优化标志产生了额外的 2 倍性能提升。https://bitbucket.org/account/user/blaze-lib/projects/BLAZE

c++ - Boost::multi_array 性能问题

16 回答 16

Related

Reference