0

我有一个 MapReduce 作业,它填充 HBase 中的搜索索引。这个 MapReduce 作业每天都在完整的数据集上运行。有没有一种方法可以在上次计算索引后到达的新数据上运行 MapReduce,然后正确更新 HBase 中的搜索索引?

4

1 回答 1

0

If your original data is saved in HBase, you can design your key so rows will be sorted by time. You can then scan the table with a start row defined as the last row you scanned yesterday + 1. You can also have the key start with the day. As rows are sorted by key, you can easily start with the first row of a desired day and stop at the next day.

If you create your rows as:

long currentTimeMS = System.currentTimeMillis();
long currentDay = currentTimeMS / (1000 * 60 * 60 * 60 * 24);
Put put = new Put(Bytes.add(Bytes.toBytes(currentDay), "some other key stuff".getBytes()));
// add columns...
hbaseTable.put(put);

You can scan a day's worth of data with:

long currentDay = currentTimeMS / (1000 * 60 * 60 * 60 * 24);
long yesterday = currentDay - 1;

Scan dayScan = new Scan(); 
dayScan.setStartRow(Bytes.toBytes(yesterday));
dayScan.setStopRow(Bytes.toBytes(currentDay));
// create map reduce job with dayScan

There are some libraries like Joda Time that make time calculations easier and the code more readable.

You can also try scan.setTimeRange() for a similar outcome. But that assumes you insert and never update the source rows, as it actually operates on the udpate time of column versions. It might also be slower as the data might not be close together thanks to sorting by row key. Overall, this doesn't seem like the recommended way to go. But for quick and dirty prototyping, it works.

If you are scanning data straight from HDFS, then you can achieve something similar by simply saving data to a different directory every day. You can then only scan yesterday's directory and nothing else.

于 2013-03-29T01:19:10.630 回答