csv - HBase 复合行键格式

Question

我正在尝试将一些大型 .csv 文件导入 HBase（总和>1TB）。数据看起来像来自关系数据库的转储，但没有 UID。我也不想导入所有列。我决定我需要先运行一个自定义 MapReduce 作业，以使它们成为所需的格式（选择列 + 生成 UID），以便我可以使用标准的 hbase importtsv 批量导入来导入它们。

我的问题：我可以使用 MapReduce 创建自己的复合行键，例如 storeID:year:UID，然后将其提供给 tsv 导入吗？所以说，我的数据如下所示：

row_key | price | quantity | item_id
A:2012:1|  0.99 |        1 |     001
A:2012:2|  0.99 |        2 |     012
B:2013:1|  0.99 |        1 |     004

据我了解，HBase 将所有内容都存储为字节，时间戳除外。它会理解这是一个复合键吗？

任何提示表示赞赏！

score 0 · Accepted Answer

I asked the same question over at Cloudera, and the answer can be found here.

Basically, the answer is yes, and no separator characters are needed. I used a MapReduce job to transform the data to the following format:

A2012:1,0.99,1,001 A2012:2,0.99,2,012

Using importtsv and completebulkload, the data was then correctly loaded into the correct HBase regions. I pre-split the table using the storeID (A,B,C,...).

csv - HBase 复合行键格式

1 回答 1

Related

Reference