我们在 MongoDB 版本 3.4 上设置了一个包含以下内容的三成员副本集:
- 基本的。物理本地服务器,Windows Server 2012,64 GB RAM,6 核。在斯堪的纳维亚举办。
- 次要的。Amazon EC2、Windows Server 2016、m4.2xlarge、32 GB RAM、8 个 vCPU。在德国举办。
- 仲裁者。微型基于云的 Linux 实例。
我们看到的问题是辅助节点无法跟上主节点。当我们用数据播种它(从主节点复制)并将其添加到副本集时,它通常会设法保持同步,但一个小时后它可能会滞后 10 分钟;几个小时后,它落后了一个小时,依此类推,直到一两天后,它就变得陈旧了。
我们正试图弄清楚为什么会这样。主节点始终使用 0-1% 的 CPU,而辅助节点始终处于 20-80% 的 CPU 的重负载下。这似乎是唯一潜在的资源限制。磁盘和网络负载似乎不是问题。辅助节点上似乎有一些锁定,因为 mongo shell(例如 db.getReplicationInfo())中的操作通常需要 5 分钟或更长时间才能完成,而 mongostat 很少工作(它只是说 i/o 超时)。以下是 mongostat 在报告辅助统计信息的罕见情况下的输出:
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
localhost:27017 *0 33 743 *0 0 166|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|1 2.33m 337k 739 rs PRI Mar 27 14:41:54.578
primary.XXX.com:27017 *0 36 825 *0 0 131|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|0 1.73m 322k 739 rs PRI Mar 27 14:41:53.614
secondary.XXX.com:27017 *0 *0 *0 *0 0 109|0 4.3% 80.0% 0 8.69G 7.54G 0|0 0|10 6.69k 134k 592 rs SEC Mar 27 14:41:53.673
我在辅助服务器上运行了 db.serverStatus(),并与主服务器进行了比较,其中一个突出的数字如下:
"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")
辅助服务器当时的正常运行时间为 14000 秒。
将不胜感激有关这可能是什么或如何调试此问题的任何想法!我们可以将 Amazon 实例升级到更强大的东西,但我们已经这样做了 3 次,此时我们认为肯定有其他问题。
我将在下面的辅助文件中包含来自 db.currentOp() 的输出,以防万一。(该命令运行了 5 分钟,之后记录了以下内容:由于错误而重新启动 oplog 查询:CursorNotFound:未找到光标,光标 id:15728290121。上次获取的 optime(带哈希):{ ts:时间戳 1490613628000|756, t: 48 }[-5363878314895774690]. 剩余重启次数:3 )
"desc":"conn605", "connectionId":605,"client":"127.0.0.1:61098", "appName":"MongoDB 外壳", "secs_running":0, "microsecs_running":NumberLong(16), “操作”:“命令”, "ns":"admin.$cmd", "查询":{"currentOp":1}, “锁”:{}, “等待锁定”:假, “lockStats”:{} "desc":"repl writer worker 10", "secs_running":0, "microsecs_running":NumberLong(14046), “操作”:“无”, "ns":"CustomerDB.ed2112ec779f", "锁":{"全局":"W","数据库":"W"}, “等待锁定”:假, "lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong (1)}}} "desc":"ApplyBatchFinalizerForJournal", “操作”:“无”, "ns":"", “锁”:{}, “等待锁定”:假, “lockStats”:{} "desc":"ReplBatcher", “secs_running”:11545, "microsecs_running":NumberLong("11545663961"), “操作”:“无”, "ns":"local.oplog.rs", “锁”:{}, “等待锁定”:假, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog" :{"acquireCount":{"r":NumberLong(1)}}} "desc":"rsBackgroundSync", “secs_running”:11545, "microsecs_running":NumberLong("11545281690"), “操作”:“无”, "ns":"local.replset.minvalid", “锁”:{}, “等待锁定”:假, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong (2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}} "desc":"TTL 监视器", “操作”:“无”, "ns":"", "锁":{"全局":"r"}, “等待锁定”:真, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong (341534123)}},"数据库":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}} "desc":"SyncSourceFeedback", “操作”:“无”, "ns":"", “锁”:{}, “等待锁定”:假, “lockStats”:{} "desc":"WT RecordStoreThread: local.oplog.rs", “secs_running”:1163, "microsecs_running":NumberLong(1163137036), “操作”:“无”, "ns":"local.oplog.rs", “锁”:{}, “等待锁定”:假, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong (1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}} "desc":"rsSync", “secs_running”:11545, "microsecs_running":NumberLong("11545663926"), “操作”:“无”, "ns":"local.replset.minvalid", “锁”:{“全球”:“W”}, “等待锁定”:假, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564) },"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529 ),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount ":{"r":NumberLong(2)}}} "desc":"clientcursormon", "secs_running":0, "microsecs_running":NumberLong(15649), “操作”:“无”, "ns":"CustomerDB.b72ac80177ef", "锁":{"全局":"r"}, “等待锁定”:真, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong (397538606)}},"数据库":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}] “好”:1}