c++ - （缺少）使用 C++11 移动语义的性能改进

Question

我已经编写 C++11 代码很长一段时间了，还没有对其进行任何基准测试，只期望像向量操作这样的东西现在可以通过移动语义“更快”。因此，当实际使用 GCC 4.7.2 和 clang 3.0（Ubuntu 12.10 64 位的默认编译器）进行基准测试时，我得到了非常不满意的结果。这是我的测试代码：

编辑：关于@DeadMG 和@ronag 发布的（好）答案，我将元素类型从没有 a更改std::string为，并使所有内部字符串更大（200-700 字节），因此它们不应该是SSO的受害者。my::stringswap()

EDIT2：牛是原因。通过出色的注释再次改编代码，将存储从std::stringto更改为std::vector<char>并省略了复制/移动 onstructors（让编译器生成它们）。没有COW，速度差异实际上是巨大的。

EDIT3：使用-DCOW. 这使得内部存储 astd::string而不是std::vector<char>@chico 要求的 a。

#include <string>
#include <vector>
#include <fstream>
#include <iostream>
#include <algorithm>
#include <functional>

static std::size_t dec = 0;

namespace my { class string
{
public:
    string( ) { }
#ifdef COW
    string( const std::string& ref ) : str( ref ), val( dec % 2 ? - ++dec : ++dec ) {
#else
    string( const std::string& ref ) : val( dec % 2 ? - ++dec : ++dec ) {
        str.resize( ref.size( ) );
        std::copy( ref.begin( ), ref.end( ), str.begin( ) );
#endif
    }

    bool operator<( const string& other ) const { return val < other.val; }

private:
#ifdef COW
    std::string str;
#else
    std::vector< char > str;
#endif
    std::size_t val;
}; }


template< typename T >
void dup_vector( T& vec )
{
    T v = vec;
    for ( typename T::iterator i = v.begin( ); i != v.end( ); ++i )
#ifdef CPP11
        vec.push_back( std::move( *i ) );
#else
        vec.push_back( *i );
#endif
}

int main( )
{
    std::ifstream file;
    file.open( "/etc/passwd" );
    std::vector< my::string > lines;
    while ( ! file.eof( ) )
    {
        std::string s;
        std::getline( file, s );
        lines.push_back( s + s + s + s + s + s + s + s + s );
    }

    while ( lines.size( ) < ( 1000 * 1000 ) )
        dup_vector( lines );
    std::cout << lines.size( ) << " elements" << std::endl;

    std::sort( lines.begin( ), lines.end( ) );

    return 0;
}

它的作用是将 /etc/passwd 读入一个行向量，然后将这个向量一遍又一遍地复制到自身上，直到我们至少有 100 万个条目。这是第一个优化应该有用的地方，不仅是std::move()你在中看到的显式优化dup_vector()，而且push_back当它需要调整内部数组的大小（创建新的 + 复制）时，它本身应该表现得更好。

最后，对向量进行排序。当您不需要每次交换两个元素时都复制临时对象时，这肯定会更快。

我以两种方式编译和运行这两种方式，一种是 C++98，另一种是 C++11（使用 -DCPP11 进行显式移动）：

1> $ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
2> $ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
3> $ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
4> $ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out

具有以下结果（每次编译两次）：

GCC C++98
1> real 0m9.626s
1> real 0m9.709s

GCC C++11
2> real 0m10.163s
2> real 0m10.130s

因此，编译为 C++11 代码时运行速度会稍慢一些。类似的结果也适用于 clang：

clang C++98
3> real 0m8.906s
3> real 0m8.750s

clang C++11
4> real 0m8.858s
4> real 0m9.053s

有人能告诉我这是为什么吗？即使在为 C++11 之前的版本进行编译时，编译器是否优化得如此之好，以至于它们实际上达到了移动语义行为？如果我添加-O2，所有代码运行速度更快，但不同标准之间的结果几乎与上面相同。

编辑：使用 my::string 而不是 std::string 和更大的单个字符串的新结果：

$ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
real    0m16.637s
$ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m17.169s
$ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
real    0m16.222s
$ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m15.652s

C++98 和 C+11 之间的移动语义差异非常小。使用 GCC 的 C++11 稍慢，使用 clang 的速度稍快，但差异仍然很小。

EDIT2：现在没有std::string's COW，性能提升是巨大的：

$ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
real    0m10.313s
$ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m5.267s
$ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
real    0m10.218s
$ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m3.376s

通过优化，差异也很大：

$ rm -f a.out ; g++ -O2 --std=c++98 test.cpp ; time ./a.out
real    0m5.243s
$ rm -f a.out ; g++ -O2 --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m0.803s
$ rm -f a.out ; clang++ -O2 --std=c++98 test.cpp ; time ./a.out
real    0m5.248s
$ rm -f a.out ; clang++ -O2 --std=c++11 -DCPP11 test.cpp ; time ./a.out
real    0m0.785s

以上显示使用 C++11 的速度提高了约 6-7 倍。

感谢您的精彩评论和回答。我希望这篇文章对其他人也有用和有趣。

score 15 · Accepted Answer

当您不需要每次交换两个元素时都复制临时对象时，这肯定会更快。

std::string有一个swap成员，所以sort已经使用它，并且它的内部实现已经是移动语义，有效。std::string只要涉及 SSO ，您就不会看到复制和移动之间的区别。此外，某些版本的 GCC 仍然有一个非 C++11 允许的基于 COW 的实现，这在复制和移动之间也没有太大区别。

score 2 · Accepted Answer

这可能是由于小字符串优化，对于短于例如 16 个字符的字符串，可能会发生（取决于编译器）。我猜文件中的所有行都很短，因为它们是密码。

当特定字符串的小字符串优化处于活动状态时，移动将作为副本完成。

您将需要更大的字符串才能看到移动语义的任何速度改进。

score 2 · Accepted Answer

我认为您需要对程序进行概要分析。也许大部分时间都花在了2000 万个字符串的行T v = vec;和向量上！！！std::sort(..)与移动语义无关。

c++ - （缺少）使用 C++11 移动语义的性能改进

3 回答 3

Related

Reference