我一直在努力优化一种专有算法,该算法根据内部索引标准对字符串代码向量进行排序。代码的长度范围为 1 - 32 个字符。该算法执行 std::swap 以将字符串移动到其新位置。
std::vector<std::string> temp_container(sorting_size+1,"");
for(auto &index_seed : all_seeds)
{
for(int i=sorting_size;i>=0;--i)
{
size_t index = // new index based on seed
std::swap(temp_container[index],original[i]);
}
original.swap(temp_container);
}
std::swap 在 std::string 上重载,应该对底层字符串执行移动,但分析似乎显示交换正在执行复制而不是移动,perf 的报告显示
- 56.95% std::swap<char, std::char_traits<char>, std::allocator<char> > ▒
- 51.14% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::swap ▒
- 17.71% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local ▒
- 8.77% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data ▒
- 7.49% std::pointer_traits<char const*>::pointer_to ▒
- 4.41% std::addressof<char const> ▒
1.83% std::__addressof<char const> ▒
0.94% std::__addressof<char const> ▒
3.48% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data ▒
2.12% std::pointer_traits<char const*>::pointer_to ▒
+ 4.99% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_set_length ▒
+ 3.76% __gnu_cxx::__alloc_traits<std::allocator<char>, char>::_S_on_swap ▒
3.62% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length ▒
3.10% std::char_traits<char>::copy ▒
3.09% __memmove_avx_unaligned_erms ▒
2.59% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data ▒
1.48% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_length ▒
1.01% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_get_allocator ▒
0.58% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data ▒
1.21% std::char_traits<char>::copy ▒
0.90% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_set_length ▒
0.90% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length ▒
0.70% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local ▒
0.69% __gnu_cxx::__alloc_traits<std::allocator<char>, char>::_S_on_swap ▒
0.67% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_get_allocator ▒
0.53% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_length
我尝试使用 std::string 交换方法并强制移动,即
// tried this
temp_container[index].swap(original[i]);
// and this
temp_container[index] = std::move(original[i]);
但他们没有任何区别,基准测试始终显示该算法需要 2 秒以上才能对大约 400k 代码字符串进行排序。
更改向量以使用 string_views 最初会产生更好的结果
std::vector<std::string_view> temp_container(sorting_size+1,"");
for(auto &index_seed : all_seeds)
{
for(int i=sorting_size;i>=0;--i)
{
size_t index = // new index based on seed
// original is now also vector of string_view
std::swap(temp_container[index],original[i]);
}
original.swap(temp_container);
}
基准测试显示 string_view 版本在 1 秒内执行相同的排序。
但是,当我尝试使用优化 -O{1,2 或 3} 编译算法时,string_view 版本比 std::string 版本慢两倍。
检查缓存未命中
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./CodeSortingAlgo
对于 std::string 非优化版本:
2,756.98 msec task-clock # 1.147 CPUs utilized
8,024,349,488 cycles # 2.911 GHz
13,979,212,612 instructions # 1.74 insn per cycle
3,560,486 cache-references # 1.291 M/sec
1,881,425 cache-misses # 52.842 % of all cache refs
2.404464923 seconds time elapsed
2.734367000 seconds user
0.024020000 seconds sys
对于 string_view 版本:
1,266.53 msec task-clock # 1.354 CPUs utilized
3,586,135,363 cycles # 2.831 GHz
4,780,766,035 instructions # 1.33 insn per cycle
7,747,467 cache-references # 6.117 M/sec
6,172,017 cache-misses # 79.665 % of all cache refs
0.935125202 seconds time elapsed
1.243645000 seconds user
0.023916000 seconds sys
对使用 -O2 std::string 版本编译的 algo 版本运行相同的操作:
281.92 msec task-clock # 1.214 CPUs utilized
794,130,273 cycles # 2.817 GHz
1,166,108,846 instructions # 1.47 insn per cycle
18,772,676 cache-references # 66.589 M/sec
3,797,519 cache-misses # 20.229 % of all cache refs
0.232186807 seconds time elapsed
0.258991000 seconds user
0.023906000 seconds sys
string_view 版本
393.30 msec task-clock # 1.137 CPUs utilized
1,124,532,667 cycles # 2.859 GHz
609,130,643 instructions # 0.54 insn per cycle
12,675,992 cache-references # 32.230 M/sec
6,753,795 cache-misses # 53.280 % of all cache refs
0.345989809 seconds time elapsed
0.366555000 seconds user
0.027590000 seconds sys
为什么 std::swap 和 std::string::swap 复制而不是移动,我错过了什么吗?我读错了性能报告吗?
为什么 std::string_view 的缓存性能这么差?是因为不是交换 len 和 ptr 而是跟随 ptr 然后交换?
为什么优化器无法将 string_view 版本优化到与字符串版本相同的级别。
编译器是 gcc 版本 8.3.0