3

当我尝试停用 Cassandra 集群中的节点时,该过程开始(我看到活动流从节点流向集群中的其他节点(使用 vnodes)),但随后nodetool decommission存在一点延迟并出现以下错误信息。

我可以重复运行 nodetool decommission 并且它将开始将数据流式传输到其他节点,但到目前为止始终存在以下错误。

为什么我会看到这个,有没有办法可以安全地停用这个节点?

Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
        at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
        at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
        at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
        at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
        at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
        at sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
        at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
        at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
        ... 33 more
4

2 回答 2

1

提示切换管理器正在检查提示以查看是否需要在停用期间传递这些提示,以便提示不会丢失。您很可能有很多提示,或者一堆墓碑,或者表中的某些内容导致查询超时。在超时之前您没有在日志中看到任何其他异常是吗?在停用节点之前提高节点上的读取超时时间,或者手动删除提示 CF,很可能会让你过去。如果您删除它们,那么您需要确保在完成所有停用后运行完整的集群修复,以传播您删除的任何提示中的数据。

于 2014-05-13T03:39:45.930 回答
0

简短的回答是,我试图退役的节点对于它所拥有的数据量来说是动力不足的。在撰写本文时,处理具有任意数量数据的节点所需的资源似乎是合理的硬性最低限度,这似乎与 AWS i2.2xlarge 提供的资源差不多。特别是,旧的 m1 实例允许您在每个节点上存储比可用内存和计算资源支持的更多的数据,从而让您陷入困境。

于 2014-04-14T16:56:57.583 回答