5

我们有一个问题,集群中的某些节点突然离开集群而没有任何明显的原因。

我们在 Elasticsearch v0.20.6、JVM 7u25 上运行。我们使用单播发现。

这是一个嵌入式 ES 实例,一个集群中有 7 个节点。一个位置(网络)上的节点 47、48、49 和 50,另一个位置上的 24、25 和 26。

每次一段时间后都会发生同样的事情,索引文件在测试之间被删除。24、25、26 个节点中的一个突然认为它是主节点(这再次导致脑裂的情况——这没关系,我理解为什么会发生这种情况,但问题是为什么会发生断开连接。

首先,NODE47 被选为 master。所有其他节点都加入了,事情顺利进行了几个小时左右。

然后突然间,在 19:10 左右,出现了明显出现问题的第一个迹象:

Node47:
2013-08-14 19:09:49,243 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}], channel closed event
2013-08-14 19:09:54,109 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], channel closed event
2013-08-14 19:10:06,008 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] disconnected from [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}], channel closed event
2013-08-14 19:10:34,253 TRACE [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][generic][T#19]) [local] [node  ] [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}] transport disconnected (with verified connect)
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#24]) [local] connected to node [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}]
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#25]) [local] connected to node [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}]
2013-08-14 19:10:34,273 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#26]) [local] connected to node [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}]
2013-08-14 19:10:34,290 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#27]) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}]


Node24:
2013-08-14 19:10:35,167 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false} but we do not exists on it, act as if its master failure
2013-08-14 19:10:35,170 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [master failure, do not exists on master, act as master failure]
2013-08-14 19:10:35,171 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#1]) [local] master_left [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [do not exists on master, act as master failure]
2013-08-14 19:10:35,174 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [possible elected master since master left (reason = do not exists on master, act as master failure)]
2013-08-14 19:10:35,181 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#1]) [local] disconnected from [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}]
2013-08-14 19:10:36,233 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false} that is no longer a master
2013-08-14 19:10:36,235 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#5]) [local] master_left [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [no longer master]
2013-08-14 19:10:36,235 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [master failure, no longer master]
2013-08-14 19:10:36,241 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [possible elected master since master left (reason = no longer master)]
2013-08-14 19:10:36,245 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#5]) [local] disconnected from [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}]
2013-08-14 19:10:37,359 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] pinging a master [local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false} that is no longer a master
2013-08-14 19:10:37,361 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#10]) [local] master_left [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [no longer master]
2013-08-14 19:10:37,363 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] stopping fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [master failure, no longer master]
2013-08-14 19:10:37,393 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#10]) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}]

据我所知的日志;这是正在发生的事情:

19:09:49,243 - 从 NODE24 到 NODE47(主)接收到一个通道关闭事件,并且它被断开 19:10:34,273 - 与 NODE24 的连接完成,然后 19:10:34,290 - 我们从NODE24 19:10:35,167 - NODE24 ping 主机 (NODE47) 但主机在其节点列表中没有 NODE24,并像主机故障一样威胁这一点。

所有这一切都在一秒钟内发生 - 唉,据我所知,这里的工作没有超时。此外,在此期间或之前没有大的 GC 或任何可测量的减速。

我不知所措;为什么会这样?如果网络问题;网络侧应该测试什么?

4

1 回答 1

2

用行为的实际原因自己回答这个问题;

2 个节点之间的 tcp 连接(同时保持与其他节点的连接)被断开。可以使用 tcpkill 之类的实用程序重新创建它。

遗憾的是,Elasticsearch Zen 发现并不能很好地处理这样的错误,并且可能会出现各种奇怪的结果。与主节点失去连接的节点将进行选举,并可能混淆其他节点。

于 2013-09-26T12:52:41.703 回答