7

我正在本地机器(伪分布式)上运行 Hadoop MapReduce 作业,该作业从 HBase 读取和写入。我间歇性地收到一个中断工作的错误,即使计算机处于单独状态而没有其他重要进程正在运行 - 请参阅下面的日志。作业终止后 ZooKeeper Dump 的输出如下所示,运行失败后客户端数量增加:

HBase is rooted at /hbase
Master address: SS-WS-M102:60000
Region server holding ROOT: SS-WS-M102:60020
Region servers:
 SS-WS-M102:60020
Quorum Server Statistics:
 ss-ws-m102:2181
  Zookeeper version: 3.3.3-cdh3u0--1, built on 03/26/2011 00:20 GMT
  Clients:
   /192.168.40.120:58484[1]\(queued=0,recved=39199,sent=39203)
   /192.168.40.120:37129[1]\(queued=0,recved=162,sent=162)
   /192.168.40.120:58485[1]\(queued=0,recved=39282,sent=39316)
   /192.168.40.120:58488[1]\(queued=0,recved=39224,sent=39226)
   /192.168.40.120:58030[0]\(queued=0,recved=1,sent=0)
   /192.168.40.120:58486[1]\(queued=0,recved=39248,sent=39267)

我的开发团队目前使用的是 CDH3U0 发行版,所以是 HBase 0.90.1——这是在最近的版本中解决的问题吗?或者我应该对当前的设置做些什么?我是否应该期望重新启动 ZK 并定期杀死客户端?我愿意接受任何可以让我的工作始终如一地完成的合理选择。

2012-06-27 13:01:07,289 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server SS-WS-M102/192.168.40.120:2181
2012-06-27 13:01:07,289 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to SS-WS-M102/192.168.40.120:2181, initiating session
2012-06-27 13:01:07,290 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server SS-WS-M102/192.168.40.120:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
    at sun.nio.ch.IOUtil.read(IOUtil.java:169)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:858)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130)
[lines above repeat 6 more times]
2012-06-27 13:01:17,890 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:991)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:302)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:293)
    at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:156)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
    at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:989)
    ... 15 more
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
    ... 16 more
4

3 回答 3

2

结果我达到了 ZooKeeper 的默认下限(我相信在最新版本中已经增加了)。我曾尝试在 hbase-site.xml 中设置更高的限制:

<property>
  <name>hbase.zookeeper.property.maxClientCnxns</name>
  <value>35</value>
</property>

但它似乎不起作用,除非它(也是?)在 zoo.cfg 中指定:

# can put this number much higher if desired
maxClientCnxns=35

该作业现在可以运行数小时,我的 ZK 客户列表达到 12 个条目的峰值。

于 2012-06-29T15:50:58.927 回答
1

我过去也遇到过类似的问题。很多时候使用 HBase/Hadoop,您会看到错误消息并不能指出您遇到的真正问题,因此您必须发挥创造力。

这是我发现的,它可能适用于您,也可能不适用于您:

您是否打开了很多与表的连接,并在完成后关闭它们?如果您在 Mapper 或 Reducer 中执行 Scans/Gets(如果可以避免的话,我认为您不想这样做),这可能发生在 MR 作业中。

此外,如果我的 Mapper 或 Reducer 大量写入同一行,有时我会遇到类似的问题。尝试分发您的写入或最小化写入以减少此问题。

如果您详细了解 MR 工作的性质,这也会有所帮助。它有什么作用?你有示例代码吗?

于 2012-06-28T21:35:27.130 回答
1

检查以下参数:

zookeeper session timeout(zookeeper.session.timeout) --> 尝试增加并检查

zookeeper ticktime(tickTime) -> 增加和测试

检查 ulimit(linux 命令检查运行 hadoop/hbase 的用户)规范

在 ulimit 情况下,您必须具有更高的值的后续参数。

打开的文件使这有点 32K 或更多

最大用户进程使其成为无限

完成这些更改后,验证很可能错误将消失

于 2013-10-03T11:49:44.640 回答