我有两个分片,每个分片有 3 台复制机器(规格相同)
这些块分布合理:
Shard events at events/xxx:27018,yyy:27018 data : 6.82GiB docs : 532402 chunks : 59 estimated data per chunk : 118.42MiB estimated docs per chunk : 9023 Shard events2 at events2/zzz:27018,qqq:27018 data : 7.3GiB docs : 618783 chunks : 66 estimated data per chunk : 113.31MiB estimated docs per chunk : 9375 Totals data : 14.12GiB docs : 1151185 chunks : 125 Shard events contains 48.29% data, 46.24% docs in cluster, avg obj size on shard : 13KiB Shard events2 contains 51.7% data, 53.75% docs in cluster, avg obj size on shard : 12KiB
然而,主节点一侧的 vmsize 几乎是 4 倍,锁定百分比接近 90%(另一侧为 2%)以及更高的 btree 计数。这会导致该机器上大量游标超时。
两个分片都应该获得相似类型的查询,并且 opcounter 值非常接近。
我该如何诊断?
更新表现不佳的一方似乎使用了大量的数据存储空间,包括 100 倍的索引空间:
"ns" : "site_events.listen",
"count" : 544213,
"size" : 7500665112,
"avgObjSize" : 13782.59084586366,
"storageSize" : 9698657792,
"numExtents" : 34,
"nindexes" : 3,
"lastExtentSize" : 1788297216,
"paddingFactor" : 1.0009999991378065,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 4630807488,
"indexSizes" : {
"_id_" : 26845184,
"uid_1" : 26664960,
"list.i_1" : 4577297344
},
对比
"ns" : "site_events.listen",
"count" : 621962,
"size" : 7891599264,
"avgObjSize" : 12688.233789202555,
"storageSize" : 9305386992,
"numExtents" : 24,
"nindexes" : 2,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000917226,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 45368624,
"indexSizes" : {
"_id_" : 22173312,
"uid_1" : 23195312
},