5

I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?

My use case is below:

What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.

4

3 回答 3

3

I don't know of anything out of the box but you can write a multi-table map/reduce.

The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name) The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync

于 2013-09-19T16:40:20.963 回答
2

我知道这个问题有点老了,但是桌子有多大?如果它们都适合内存,您可以使用 HBaseStorage 将它们加载到 Pig 中,然后使用 Pig 的内置DIFF函数来比较生成的包。

根据文档,这甚至适用于不适合内存的大型表,但它会非常慢。

于 2014-01-07T19:50:07.687 回答
0
dataset1 = LOAD '/path/to/dataset1' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset2 = LOAD '/path/to/dataset2' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);

dataset3 = COGROUP dataset1 BY (a, b,c, d), dataset2 BY (a, b, c, d);

dataset4 = FOREACH dataset3 GENERATE DIFF(dataset1,dataset2);
于 2018-04-26T23:05:59.063 回答