go - 为什么在多 goroutine 中读写 slice 的成本很高？

Question

我写了一个多协程版本的mergeSort by go，我也写了一个基准测试。现在我想使用“go tool pprof”来分析我的代码的瓶颈。

当我得到 cpu 配置文件时，我在 pprof 中使用“top10”来获得以下输出：

Showing nodes accounting for 4.73s, 98.54% of 4.80s total
Dropped 21 nodes (cum <= 0.02s)
Showing top 10 nodes out of 30
      flat  flat%   sum%        cum   cum%
     3.66s 76.25% 76.25%      3.66s 76.25%  pingcap/talentplan/tidb/common/alg/sort.Merge
     0.62s 12.92% 89.17%      0.64s 13.33%  pingcap/talentplan/tidb/mergesort.prepare
     0.17s  3.54% 92.71%      0.17s  3.54%  runtime.freedefer
     0.12s  2.50% 95.21%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.quickSort
     0.10s  2.08% 97.29%      0.10s  2.08%  runtime.memmove
     0.03s  0.62% 97.92%      0.03s  0.62%  runtime.memclrNoHeapPointers
     0.03s  0.62% 98.54%      0.04s  0.83%  runtime.stackpoolalloc
         0     0% 98.54%      0.11s  2.29%  pingcap/talentplan/tidb/common/alg/sort.MergeSortByMultiGoroutine
         0     0% 98.54%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.QuickSort
         0     0% 98.54%      4.04s 84.17%  pingcap/talentplan/tidb/common/alg/sort.mergeSortByMultiGoroutine

从上面，我认为奇怪的瓶颈在sort.Merge，所以我使用“list Merge”来深入研究方法，我发现了以下信息：

     .          .     50:func Merge(arr []int64, start int, mid int, end int, tmp []int64)  {
     .          .     51:   index, i, j := start, start, mid + 1
  80ms       80ms     52:   for ; i <= mid && j <= end; index++ {
 180ms      180ms     53:           if arr[i] <= arr[j] {
 1.58s      1.58s     54:                   tmp[index] = arr[i]
  50ms       50ms     55:                   i++
     .          .     56:           } else {
 1.52s      1.52s     57:                   tmp[index] = arr[j]
  20ms       20ms     58:                   j++
     .          .     59:           }
     .          .     60:   }
     .          .     61:   for ; i <= mid; index++ {
     .          .     62:           tmp[index] = arr[i]
     .          .     63:           i++
     .          .     64:   }
     .          .     65:   for ; j <= end; index++ {
     .          .     66:           tmp[index] = arr[j]
     .          .     67:           j++
     .          .     68:   }
     .          .     69:
  60ms       60ms     70:   for i := start; i <= end; i++ {
 170ms      170ms     71:           arr[i] = tmp[i]
     .          .     72:   }
     .          .     73:}

让我感到困惑的是这里！在 Merge 方法中，有 4 个 for 循环。第一个 for 循环和第四个 for 循环的规模大致相同，它们的任务都是将元素从一个切片移动到另一个切片。问题是为什么第一个 for 循环花费这么多（1.58s 加 1.52s），而第四个 for 循环花费这么少（只有 170ms）？是反直觉的！

本项目的 github 地址为https://github.com/Duncan15/talent-plan/tree/master/tidb/mergesort。您可以使用“make pprof”运行基准测试并获取 cpu 配置文件和内存配置文件。

我想知道为什么会这样，如果你有时间，请阅读我的代码并给我一些建议。

谢谢你告诉我！！！

我已经编写了一些代码来验证当 Merge 方法在单 goroutine 环境中运行时，第一个 for 循环的成本与第四个 for 循环的成本大致相同，这看起来很直观。所以我想是不是多goroutine环境造成了上述现象。但是在多协程环境中，Merge 方法是并发运行的。也就是说，第1个for循环和第4个for循环同时运行，如果同时读写slice会增加开销，那么第4个for循环的开销也会增加，但是从pprof的输出中我们可以发现只有第1个for-loop的成本增加！

我还写了另一个测试来验证我的想法。您可以使用“make vet”来运行它。此测试同时运行 Merge 方法。与多协程版本的mergeSort不同的是，这个测试没有排序的代码，只有合并的代码。而且我惊讶地发现，在这个测试中，第一个 for 循环的成本与第四个 for 循环的成本大致相同。所以最后我完全糊涂了！

go - 为什么在多 goroutine 中读写 slice 的成本很高？

0 回答 0

Related

Reference