0

我们今天刚刚将 Cassandra 1.1.7 投入生产,但就在此之前,我们看到两个 Cassandra 节点因 OOM 而停机。我们过去在负载测试中看到了这个错误,并相应地调整了 nofile,因此这些错误不应该发生。另请注意,当基础架构上没有负载时会发生此错误。

ERROR [Thread-22] 2013-07-08 16:31:50,905 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[Thread-22,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:652)
at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:581)
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:155)
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:113)

我们无法启动 Cassandra 服务器备份(一直给出 OOM 错误)。只有在我们关闭 OpsCenter (Enterprise 2.1.3) 代理后,我们才能重新启动 Cassandra,然后重新启动代理。下面是接近 Cassandra 节点死亡时间的 agent.log。我们看到很多节俭操作队列已满,操作被丢弃。我们也没有使用二级索引。欢迎任何想法,

WARN [pool-4-thread-1] 2013-07-08 16:31:41,395 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,396 367168 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,396 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,396 367169 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,396 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,396 367170 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 367171 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 367172 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367173 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367174 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367175 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 367176 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 367177 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367178 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367179 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367180 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 367181 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 367182 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 367183 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 367184 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367185 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367186 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367187 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 367188 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 367189 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,405 367190 operations dropped so far.
ERROR [Thread-4] 2013-07-08 16:31:45,347 Error when proccessing thrift callme.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
ERROR [pool-5-thread-1] 2013-07-08 16:31:47,793 Error connecting via JMX: java.io.IOException: Cannot run program "cat": java.io.IOException: error=11, Resource temporarily unavailable
ERROR [Thread-4] 2013-07-08 16:31:50,348 Error when proccessing thrift callme.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
INFO [pool-5-thread-1] 2013-07-08 16:31:52,794 New JMX connection (127.0.0.1:7199)
ERROR [pool-5-thread-1] 2013-07-08 16:31:52,857 Error connecting via JMX: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused]
WARN [pool-3-thread-4] 2013-07-08 16:31:53,127 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 367191 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 367192 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367193 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367194 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367195 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,130 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,130 367196 operations dropped so far.
ERROR [Thread-4] 2013-07-08 16:31:55,350 Could not flush transport (to be expected if the pool is shutting down) in close for client: CassandraClient<16.211.56.72:9160-3>
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 367171 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,397 367172 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367173 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367174 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,398 367175 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 367176 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 367177 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,399 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367178 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367179 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,400 367180 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 367181 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,401 367182 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 367183 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 367184 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,402 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367185 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367186 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,403 367187 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 367188 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 367189 operations dropped so far.
WARN [pool-4-thread-1] 2013-07-08 16:31:41,404 Thrift operation queue is full, discarding thrift operation
WARN [pool-4-thread-1] 2013-07-08 16:31:41,405 367190 operations dropped so far.
ERROR [Thread-4] 2013-07-08 16:31:45,347 Error when proccessing thrift callme.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
ERROR [pool-5-thread-1] 2013-07-08 16:31:47,793 Error connecting via JMX: java.io.IOException: Cannot run program "cat": java.io.IOException: error=11, Resource temporarily unavailable
ERROR [Thread-4] 2013-07-08 16:31:50,348 Error when proccessing thrift callme.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
INFO [pool-5-thread-1] 2013-07-08 16:31:52,794 New JMX connection (127.0.0.1:7199)
ERROR [pool-5-thread-1] 2013-07-08 16:31:52,857 Error connecting via JMX: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused]
WARN [pool-3-thread-4] 2013-07-08 16:31:53,127 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 367191 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 367192 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,128 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367193 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367194 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,129 367195 operations dropped so far.
WARN [pool-3-thread-4] 2013-07-08 16:31:53,130 Thrift operation queue is full, discarding thrift operation
WARN [pool-3-thread-4] 2013-07-08 16:31:53,130 367196 operations dropped so far.
ERROR [Thread-4] 2013-07-08 16:31:55,350 Could not flush transport (to be expected if the pool is shutting down) in close for client: CassandraClient<16.211.56.72:9160-3>
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:98)
at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:26)
at me.prettyprint.cassandra.connection.HConnectionManager.closeClient(HConnectionManager.java:311)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at clj_hector.core$put.doInvoke(core.clj:164)
at clojure.lang.RestFn.invoke(RestFn.java:470)
at opsagent.cassandra$store_rollup.invoke(cassandra.clj:107)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at opsagent.cassandra$async_call$fn__582$fn__583.invoke(cassandra.clj:164)
at opsagent.cassandra$process_queue$fn__587.invoke(cassandra.clj:170)
at opsagent.cassandra$process_queue.invoke(cassandra.clj:169)
at opsagent.cassandra$setup_cassandra$fn__595.invoke(cassandra.clj:203)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 19 more
ERROR [Thread-4] 2013-07-08 16:31:55,351 MARK HOST AS DOWN TRIGGERED for host 16.211.56.72(16.211.56.72):9160
ERROR [Thread-4] 2013-07-08 16:31:55,351 Pool state on shutdown: <ConcurrentCassandraClientPoolByHost>:{16.211.56.72(16.211.56.72):9160}; IsActive?: true; Active: 1; Blocked: 0; Idle: 0; NumBeforeExhausted: 0
INFO [Thread-4] 2013-07-08 16:31:55,351 Shutdown triggered on <ConcurrentCassandraClientPoolByHost>:{16.211.56.72(16.211.56.72):9160}
INFO [Thread-4] 2013-07-08 16:31:55,351 Shutdown complete on <ConcurrentCassandraClientPoolByHost>:{16.211.56.72(16.211.56.72):9160}
INFO [Thread-4] 2013-07-08 16:31:55,352 Host detected as down was added to retry queue: 16.211.56.72(16.211.56.72):9160
WARN [Thread-4] 2013-07-08 16:31:55,392 Could not fullfill request on this host CassandraClient<16.211.56.72:9160-3>
4

2 回答 2

2

“无法创建新的本地线程”是你的确凿证据。有帮助的事情包括:

  • 增加内核线程限制
  • 将 thrift 切换到 hsha 而不是 thread-per-conn
  • 升级到最新的 OpsCenter 代理
于 2013-07-09T16:56:44.330 回答
0

我遇到了这个问题并进行了一些更改和实验(以及大量阅读,因为一些建议在我的 Linux Ubuntu 服务器上立即有效,但在我的开发机器 - Mac OS 上却没有)。正如大多数人所建议的那样,我增加了操作系统“maxfiles”、“maxfilesperproc”等……但问题在 Mac OS 上仍然存在。当(就像 jbellis 建议的那样)更改 cassandra 配置(/conf/cassandra.yaml)时,我终于设法摆脱了它,即以下参数:

...
native_transport_min_threads: 16
native_transport_max_threads: 128
...
rpc_server_type: hsha #changed from sync to hsha
...
rpc_min_threads: 16 
rpc_max_threads: 2048
...

可能您需要将其调整为您自己的设置(主要是线程数)......但这些变量似乎是解决“java.lang.OutOfMemoryError:无法创建新的本机线程”问题。

于 2013-07-27T21:23:23.577 回答