3

我一直在努力优化一种专有算法,该算法根据内部索引标准对字符串代码向量进行排序。代码的长度范围为 1 - 32 个字符。该算法执行 std::swap 以将字符串移动到其新位置。

std::vector<std::string> temp_container(sorting_size+1,"");
for(auto &index_seed : all_seeds)
{

    for(int i=sorting_size;i>=0;--i)
    {
        size_t index = // new index based on seed

        std::swap(temp_container[index],original[i]);
    }

    original.swap(temp_container);
}

std::swap 在 std::string 上重载,应该对底层字符串执行移动,但分析似乎显示交换正在执行复制而不是移动,perf 的报告显示

   - 56.95% std::swap<char, std::char_traits<char>, std::allocator<char> >                                                                                                                                                   ▒
      - 51.14% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::swap                                                                                                                         ▒
         - 17.71% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local                                                                                                               ▒
            - 8.77% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data                                                                                                           ▒
               - 7.49% std::pointer_traits<char const*>::pointer_to                                                                                                                                                          ▒
                  - 4.41% std::addressof<char const>                                                                                                                                                                         ▒
                       1.83% std::__addressof<char const>                                                                                                                                                                    ▒
                    0.94% std::__addressof<char const>                                                                                                                                                                       ▒
              3.48% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data                                                                                                                 ▒
              2.12% std::pointer_traits<char const*>::pointer_to                                                                                                                                                             ▒
         + 4.99% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_set_length                                                                                                              ▒
         + 3.76% __gnu_cxx::__alloc_traits<std::allocator<char>, char>::_S_on_swap                                                                                                                                           ▒
           3.62% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length                                                                                                                     ▒
           3.10% std::char_traits<char>::copy                                                                                                                                                                                ▒
           3.09% __memmove_avx_unaligned_erms                                                                                                                                                                                ▒
           2.59% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data                                                                                                              ▒
           1.48% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_length                                                                                                                  ▒
           1.01% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_get_allocator                                                                                                           ▒
           0.58% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data                                                                                                                    ▒
        1.21% std::char_traits<char>::copy                                                                                                                                                                                   ▒
        0.90% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_set_length                                                                                                                 ▒
        0.90% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length                                                                                                                        ▒
        0.70% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local                                                                                                                   ▒
        0.69% __gnu_cxx::__alloc_traits<std::allocator<char>, char>::_S_on_swap                                                                                                                                              ▒
        0.67% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_get_allocator                                                                                                              ▒
        0.53% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_length

我尝试使用 std::string 交换方法并强制移动,即

// tried this
temp_container[index].swap(original[i]);
// and this
temp_container[index] = std::move(original[i]);

但他们没有任何区别,基准测试始终显示该算法需要 2 秒以上才能对大约 400k 代码字符串进行排序。

更改向量以使用 string_views 最初会产生更好的结果

std::vector<std::string_view> temp_container(sorting_size+1,"");
for(auto &index_seed : all_seeds)
{

    for(int i=sorting_size;i>=0;--i)
    {
        size_t index = // new index based on seed
        // original is now also vector of string_view
        std::swap(temp_container[index],original[i]);
    }

    original.swap(temp_container);
}

基准测试显示 string_view 版本在 1 秒内执行相同的排序。

但是,当我尝试使用优化 -O{1,2 或 3} 编译算法时,string_view 版本比 std::string 版本慢两倍。

检查缓存未命中

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./CodeSortingAlgo

对于 std::string 非优化版本:

          2,756.98 msec task-clock                #    1.147 CPUs utilized          
     8,024,349,488      cycles                    #    2.911 GHz                    
    13,979,212,612      instructions              #    1.74  insn per cycle         
         3,560,486      cache-references          #    1.291 M/sec                  
         1,881,425      cache-misses              #   52.842 % of all cache refs    

       2.404464923 seconds time elapsed

       2.734367000 seconds user
       0.024020000 seconds sys

对于 string_view 版本:

          1,266.53 msec task-clock                #    1.354 CPUs utilized          
     3,586,135,363      cycles                    #    2.831 GHz                    
     4,780,766,035      instructions              #    1.33  insn per cycle         
         7,747,467      cache-references          #    6.117 M/sec                  
         6,172,017      cache-misses              #   79.665 % of all cache refs    

       0.935125202 seconds time elapsed

       1.243645000 seconds user
       0.023916000 seconds sys

对使用 -O2 std::string 版本编译的 algo 版本运行相同的操作:

            281.92 msec task-clock                #    1.214 CPUs utilized          
       794,130,273      cycles                    #    2.817 GHz                    
     1,166,108,846      instructions              #    1.47  insn per cycle         
        18,772,676      cache-references          #   66.589 M/sec                  
         3,797,519      cache-misses              #   20.229 % of all cache refs    

       0.232186807 seconds time elapsed

       0.258991000 seconds user
       0.023906000 seconds sys

string_view 版本

            393.30 msec task-clock                #    1.137 CPUs utilized          
     1,124,532,667      cycles                    #    2.859 GHz                    
       609,130,643      instructions              #    0.54  insn per cycle         
        12,675,992      cache-references          #   32.230 M/sec                  
         6,753,795      cache-misses              #   53.280 % of all cache refs    

       0.345989809 seconds time elapsed

       0.366555000 seconds user
       0.027590000 seconds sys

为什么 std::swap 和 std::string::swap 复制而不是移动,我错过了什么吗?我读错了性能报告吗?

为什么 std::string_view 的缓存性能这么差?是因为不是交换 len 和 ptr 而是跟随 ptr 然后交换?

为什么优化器无法将 string_view 版本优化到与字符串版本相同的级别。

编译器是 gcc 版本 8.3.0

4

0 回答 0