2

很少,当我的 RabbitMQ 应用程序在比平时更多的负载下开始返回 SocketException: Broken pipe (并且基本上不处理任何进一步的消息)。

系统使用 RPC 模式,工作人员在一些预定义的队列上监听作业,客户端在这些作业上提交任务,同时打开一个临时的自动删除队列,他们指定为 replyTo 队列,在该队列上监听回复(并使用相关 ID 以及匹配消息)。

实际上导致 Broken pipe 的代码非常简单,它在客户端部分,基本上是这样的:

factory = new ConnectionFactory();
factory.setUri(uri);
connection = factory.newConnection(); // this is when we get the exception

例外情况如下:

2013-09-06 21:37:03,947 +0000 [http-bio-8080-exec-350] ERROR RabbitRpcClient:79  - IOException 
java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:142)
    at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:488)
    at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:125)
    at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:316)
    at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:292)
    at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:285)
    at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:383)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:516)
    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:533)     
    ...

我认为这通常与工作人员在他们的业务上花费的时间比平时更长,因此同时打开更多临时客户队列(也许大约 20-30 个?),但据我所知,我没有遇到任何常见的水印(内存,磁盘 - 我可能会遇到一些我不知道的限制)。

我查看了 Rabbit 日志,发现的唯一错误是:

=ERROR REPORT==== 6-Sep-2013::21:36:59 ===
closing AMQP connection <0.3105.1297> (10.118.69.132:42582 -> 10.12.111.134:5672):
{handshake_timeout,frame_header}

我检查了两个日志,客户端上的第一个“损坏的管道”出现在 21:37:03,而该日期 RabbitMQ 日志中的任何类型的第一个 ERROR 出现在 21:36:59,并且有相同类型的常规错误此后定期出现,直到系统重新启动。因此,我相信发布的是相应的日志条目。

我正在使用 Rabbit Java 客户端 3.1.4(Maven 中心的最新版本)和在 AWS EC2 上的 Amazon Linux 上运行的 Rabbit 服务器 3.1.4。

这里是正常情况下的rabbitmqctl状态(可惜不是在失败期间,下次出现时我会尝试获取):

Status of node 'rabbit@ip-some-ip' ...
[{pid,2654},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.1.4"},
  {rabbitmq_management_agent,"RabbitMQ Management Agent","3.1.4"},
  {rabbit,"RabbitMQ","3.1.4"},
  {os_mon,"CPO  CXC 138 46","2.2.7"},
  {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.4"},
  {webmachine,"webmachine","1.10.3-rmq3.1.4-gite9359c7"},
  {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.1.4-git680dba8"},
  {xmerl,"XML parser","1.2.10"},
  {inets,"INETS  CXC 138 49","5.7.1"},
  {mnesia,"MNESIA  CXC 138 12","4.5"},
  {amqp_client,"RabbitMQ AMQP Client","3.1.4"},
  {sasl,"SASL  CXC 138 11","2.1.10"},
  {stdlib,"ERTS  CXC 138 10","1.17.5"},
  {kernel,"ERTS  CXC 138 10","2.14.5"}]},
 {os,{unix,linux}},
 {erlang_version,
 "Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2]     [async-threads:30] [kernel-poll:true]\n"},
{memory,
 [{total,331967824},
  {connection_procs,5389784},
  {queue_procs,2669016},
  {plugins,654768},
  {other_proc,10063336},
  {mnesia,90352},
  {mgmt_db,2706344},
  {msg_index,7148168},
  {other_ets,3495648},
  {binary,1952040},
  {code,17696200},
  {atom,1567425},
  {other_system,278534743}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3126832332},
{disk_free_limit,1000000000},
{disk_free,1487147008},
{file_descriptors,
 [{total_limit,349900},
  {total_used,71},
  {sockets_limit,314908},
  {sockets_used,66}]},
{processes,[{limit,1048576},{used,930}]},
{run_queue,0},
 {uptime,5680}]
 ...done.

有什么想法可能是错误的,或者至少我可以做些什么来调试这个/更清楚地了解正在发生的事情?

4

1 回答 1

1

我已经更改了我的代码以重用 Connection 对象 - 实际上甚至在多个线程中都这样做,而且似乎问题不会重复出现(手指交叉)。

于 2013-09-13T19:09:01.277 回答