c++ - 优化位数组访问

Question

我正在使用 Dipperstein 的 bitarray.cpp 类来处理双层（黑白）图像，其中图像数据本身存储为一个像素一位。

我需要遍历每一位，每张图像大约 4--9 兆像素，超过数百张图像，使用 for 循环，例如：

for( int i = 0; i < imgLength; i++) {
    if( myBitArray[i] == 1 ) {
         //  ... do stuff ...
    }
}

性能是可用的，但并不惊人。我通过 gprof 运行程序，发现有大量时间和数百万次调用std::vector迭代器和开始等方法。这是顶部采样的函数：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 37.91      0.80     0.80        2     0.40     1.01  findPattern(bit_array_c*, bool*, int, int, int)
 12.32      1.06     0.26 98375762     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::__normal_iterator(unsigned char const* const&)
 11.85      1.31     0.25 48183659     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator+(int const&) const
 11.37      1.55     0.24 49187881     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::begin() const
  9.24      1.75     0.20 48183659     0.00     0.00  bit_array_c::operator[](unsigned int) const
  8.06      1.92     0.17 48183659     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::operator[](unsigned int) const
  5.21      2.02     0.11 48183659     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator*() const
  0.95      2.04     0.02                             bit_array_c::operator()(unsigned int)
  0.47      2.06     0.01  6025316     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >::__normal_iterator(unsigned char* const&)
  0.47      2.06     0.01  3012657     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator*() const
  0.47      2.08     0.01  1004222     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::end() const
... remainder omitted ...

我对 C++ 的 STL 不是很熟悉，但是谁能解释为什么 std::vector::begin() 被调用了几百万次？而且，当然，我是否可以做些什么来加快速度？

编辑：我只是放弃并优化了搜索功能（循环）。

score 6 · Accepted Answer

您在配置文件输出中看到很多内联函数这一事实意味着它们没有被内联——也就是说，您没有在打开优化的情况下进行编译。因此，优化代码最简单的方法是使用 -O2 或 -O3。

分析未优化的代码很少值得，因为优化和未优化代码的执行配置文件可能完全不同。 33

score 2 · Accepted Answer

bitarray.cpp 代码的快速峰值显示：

bool bit_array_c::operator[](const unsigned int bit) const
{
    return((m_Array[BIT_CHAR(bit)] & BIT_IN_CHAR(bit)) != 0);
}

m_Array 的类型为 std::vector

STL 向量上的 [] 运算符具有恒定的复杂性，但它可能实现为对 vector::begin 的调用以获取数组的基地址，然后计算偏移量以获取所需的值。由于 bitarray.cpp 在 EVERY BIT ACCESS 上调用 [] 运算符，因此您接到了很多电话。

给定您的用例，我将创建 bitarray.cpp 中包含的功能的自定义实现，并针对您的顺序、逐位访问模式对其进行调整。

不要使用无符号字符，使用 32 或 64 位值来减少所需的内存访问次数。
我会使用普通数组，而不是向量来避免查找开销
创建一个不进行所有查找的顺序访问函数 nextbit()。存储一个指向当前“值”的指针，您需要在 32/64 位边界上递增它，边界之间的所有访问都是简单的掩码/移位操作，应该非常快。

score 1 · Accepted Answer

如果没有看到您的代码，很难就如何加快您的工作速度做出具体评论。但是，vector::begin()它用于将迭代器返回到向量中的第一个元素 - 它是遍历向量时的标准例程。

我实际上建议使用更现代的分析器，例如OProfile，这将为您提供有关程序花费时间的更细粒度的信息 - 到实际的 C++ 行，甚至是单独的 asm 指令，具体取决于您的方式运行。

顺便说一句- 你为什么选择使用bitarray.cpp而不是 vanilla std::vector<bool>？我自己没有使用过它，但是快速浏览一下上面的链接表明它bitarray.cpp支持额外的功能std::vector<bool>，如果你不使用它可能会增加与 STL 向量类相比的开销......

score 0 · Accepted Answer

您可以通过使用指针/迭代器来提高性能（我不确定 bitarray.cpp 到底为您做了什么），如下所示：

for (bool *ptr = myBitArray, int i = 0; i != imgLength; ++i, ++ptr)
{
   if (*myBitArray == 1)
   {
       //handle
   }
}

我只在这里使用 int i 因为我不确定您的位数组是否会以空值终止，在这种情况下您的条件可能只是

*myBitArray != '\0';

或者你可以破解一个更好的结束条件。最好使用 std::iterator ，但我怀疑你的 bitarray 会支持它。

编辑：

通常这将是一个微优化，但如果你循环了足够多的东西，它可能会稍微提高性能。

score 0 · Accepted Answer

如果性能足够重要以至于您不得不担心访问单个位，那么您可能应该并行化您的代码。由于您将其描述为图像处理，因此位 i 的状态很可能不会影响您处理位 i+1 到 i+6 的方式，因此您可能可以重写代码以一次对字节和字进行操作。仅仅能够将计数器增加 8 到 64 次的频率应该会提供可衡量的性能提升，并且还会使编译器更容易优化您的代码。

c++ - 优化位数组访问

5 回答 5

Related

Reference