很少,当我的 RabbitMQ 应用程序在比平时更多的负载下开始返回 SocketException: Broken pipe (并且基本上不处理任何进一步的消息)。
系统使用 RPC 模式,工作人员在一些预定义的队列上监听作业,客户端在这些作业上提交任务,同时打开一个临时的自动删除队列,他们指定为 replyTo 队列,在该队列上监听回复(并使用相关 ID 以及匹配消息)。
实际上导致 Broken pipe 的代码非常简单,它在客户端部分,基本上是这样的:
factory = new ConnectionFactory();
factory.setUri(uri);
connection = factory.newConnection(); // this is when we get the exception
例外情况如下:
2013-09-06 21:37:03,947 +0000 [http-bio-8080-exec-350] ERROR RabbitRpcClient:79 - IOException
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:142)
at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:488)
at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:125)
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:316)
at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:292)
at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:285)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:383)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:516)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:533)
...
我认为这通常与工作人员在他们的业务上花费的时间比平时更长,因此同时打开更多临时客户队列(也许大约 20-30 个?),但据我所知,我没有遇到任何常见的水印(内存,磁盘 - 我可能会遇到一些我不知道的限制)。
我查看了 Rabbit 日志,发现的唯一错误是:
=ERROR REPORT==== 6-Sep-2013::21:36:59 ===
closing AMQP connection <0.3105.1297> (10.118.69.132:42582 -> 10.12.111.134:5672):
{handshake_timeout,frame_header}
我检查了两个日志,客户端上的第一个“损坏的管道”出现在 21:37:03,而该日期 RabbitMQ 日志中的任何类型的第一个 ERROR 出现在 21:36:59,并且有相同类型的常规错误此后定期出现,直到系统重新启动。因此,我相信发布的是相应的日志条目。
我正在使用 Rabbit Java 客户端 3.1.4(Maven 中心的最新版本)和在 AWS EC2 上的 Amazon Linux 上运行的 Rabbit 服务器 3.1.4。
这里是正常情况下的rabbitmqctl状态(可惜不是在失败期间,下次出现时我会尝试获取):
Status of node 'rabbit@ip-some-ip' ...
[{pid,2654},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","3.1.4"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.1.4"},
{rabbit,"RabbitMQ","3.1.4"},
{os_mon,"CPO CXC 138 46","2.2.7"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.4"},
{webmachine,"webmachine","1.10.3-rmq3.1.4-gite9359c7"},
{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.1.4-git680dba8"},
{xmerl,"XML parser","1.2.10"},
{inets,"INETS CXC 138 49","5.7.1"},
{mnesia,"MNESIA CXC 138 12","4.5"},
{amqp_client,"RabbitMQ AMQP Client","3.1.4"},
{sasl,"SASL CXC 138 11","2.1.10"},
{stdlib,"ERTS CXC 138 10","1.17.5"},
{kernel,"ERTS CXC 138 10","2.14.5"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,331967824},
{connection_procs,5389784},
{queue_procs,2669016},
{plugins,654768},
{other_proc,10063336},
{mnesia,90352},
{mgmt_db,2706344},
{msg_index,7148168},
{other_ets,3495648},
{binary,1952040},
{code,17696200},
{atom,1567425},
{other_system,278534743}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3126832332},
{disk_free_limit,1000000000},
{disk_free,1487147008},
{file_descriptors,
[{total_limit,349900},
{total_used,71},
{sockets_limit,314908},
{sockets_used,66}]},
{processes,[{limit,1048576},{used,930}]},
{run_queue,0},
{uptime,5680}]
...done.
有什么想法可能是错误的,或者至少我可以做些什么来调试这个/更清楚地了解正在发生的事情?