2

I have recently switched to Hbase from rdbms for handling millions of records.. But as a newbie I am not sure what is the efficient way of designing Hbase scheme. Actually, scenario is I have text files which have hundred, thousands and millions records that I have to read and store into Hbase. So, there are two set of text files(RawData File, Label File) which are linked to each other as they belong to same user, for these files I have made two separate tables(RawData and Label) and I am storing their info there. So RawData file and RawData table look like this:

enter image description here enter image description here

So you can see in my RawData table I have row key which is actually a file name of text file( 01-01-All-Data.txt) with the row number of each row of textfile. And column family is just random 'r' and column qualifiers are the columns of text files and value are the values of column. This is how I am inserting record in my table and I have third table(MapFile) where I store name of textfile as a row key user id of user as column qualifier and total number of records of textfile as value which looks like this:

            01-01-All-Data.txt       column=m:1, timestamp=1375189274467, value=146209  

I will use Mapfile table in order to read RawData table row by row..

What is your suggestion about this kind Hbase Schema? Is it a proper way? or it doesn't make sense in Hbase concepts?

Furthermore, It worths to mention that it is taking around 3 mins in inserting 21 mbs file with 146207 rows in Hbase.

Please Advice.

Thanks

4

1 回答 1

6

虽然我没有发现您当前的架构有什么问题,但只有在分析您的用例和频繁访问模式之后才能确定它是否合适。恕我直言,正确并不总是合适的。由于我对这一切一无所知,我的建议可能听起来不正确。请让我知道是否是这种情况。我会相应地更新答案。开始了,

只有一个包含 3 个列族的表是否有意义(牢记您的数据和访问模式):

  • RD - 对于将包含此文件的所有列的 RawData 文件
  • LF - 对于包含该文件所有列的标签文件,以及
  • MF - 对于具有一列保存文本文件记录数的 MapFile。

使用用户 ID 作为行键。它将是独一无二的,看起来不会很冗长。使用这种设计,您可以在获取数据时绕过从一个表分流到另一个表的开销。

更多建议:

  • 如果用户 ID 单调增加,那么散列你的行键,这样你就不会受到RegionServer Hotspotting的影响。
  • 您还可以创建预拆分表以获得更好的分布。
  • 如果可能,请缩短列名。
  • 尽量减少版本号。

此外,值得一提的是,在 Hbase 中插入 146207 行的 21 mbs 文件大约需要 3 分钟。

你是如何插入数据的?MapReduce 还是普通的 Java+HBAse API?你的集群大小是多少?配置和规格?

您可能会发现这些链接很有用:

高温高压

于 2013-07-31T15:24:05.410 回答