0

我正在从事一个涉及监视大量 rss/atom 提要的项目。我想使用 hbase 进行数据存储,但在设计架构时遇到了一些问题。对于第一次迭代,我希望能够生成一个聚合的提要(所有提要的最后 100 个帖子按时间倒序排列)。

目前我正在使用两个表:

Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta

我需要某种用于聚合提要的索引表。我应该如何构建它?hbase 是这类应用程序的好选择吗?

问题更新:是否有可能(在 hbase 中)设计一种可以有效回答如下查询的模式?

SELECT data FROM Urls ORDER BY date DESC LIMIT 100
4

1 回答 1

2

Peter Rietzler answer on hbase-user mail list:

Hi

In our project we are handling event lists where we have similar requirements. We do ordering by choosing our row keys wisely. We use the following key for our events (they should be ordered by time in ascending order):

eventListName/yyyyMMddHHmmssSSS-000[-111]

where eventListName is the name of the event list and 000 is a three digit instance id to disambiguate between different running instances of application, and -111 is optional to disambiguate events that occured in the same millisecond on one instance.

We additionally insert and artifical row for each day with the id

eventListName/yyyyMMddHHmmssSSS

This allows us to start scanning at the beginning of each day without searching through the event list.

You need to be aware of the fact that if you have a very high load of inserts, then always one hbase region server is busy inserting while the others are idle ... if that's a problem for you, you have to find different keys for your purpose.

You could also use an HBase index table but I have no experience with it and I remember an email on the mailing list that this would double all requests because the API would first lookup the index table and then the original table ??? (please correct me if this is not right ...)

Kind regards, Peter

Thanks Peter.

于 2009-08-17T08:25:56.493 回答