hadoop - 来自 HTable 的 MapReduce 输入

Question

我有一个 MapReduce 作业，输入来自 HTable。从 Java MapReduce 代码中，如何将 Job 输入格式设置为 HBase TableInputFormat？

有没有类似 JDBC 连接的东西可以连接到 HTable 数据库？

score 1 · Accepted Answer

如果您的客户端和 HBase 在同一台机器上运行，则无需为客户端配置任何内容即可与 HBase 通信。只需创建一个 HBaseConfiguration 实例并连接到您的 HTable ：

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "TABLE_NAME");

但是，如果您的客户端在远程机器上运行，它依赖 ZooKeeper 来与您的 HBase 集群通信。因此，客户端在继续之前需要 ZooKeeper 集合的位置。这就是我们通常配置客户端以使它们连接到 HBase 集群的方式：

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "ZK_MACHINE_IP/HOSTNAME");
conf.set("hbase.zookeeper.property.clientPort","2181");
HTable table = new HTable(conf, "TABLE_NAME");

这就是您通过 Java API 执行此操作的方式。HBase 也支持其他一些 API。您可以在此处找到更多相关信息。

谈到你的第一个问题，如果你需要在你的 MR 作业中使用 TableInputFormat 作为 InputFormat，你可以通过 Job 对象来完成，如下所示：

job.setInputFormatClass(TableInputFormat.class);

希望这能回答你的问题。

score 1 · Accepted Answer

HBase 带有一个TableMapResudeUtil类，可以轻松设置 map/reduce 作业这是手册中的第一个示例：

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

hadoop - 来自 HTable 的 MapReduce 输入

2 回答 2

Related

Reference