4


我正在考虑使用 HBase 来存储日志(网络日志数据),每个日志将有大约 20 个不同的值(比如说列),我想运行基于这些列过滤结果的查询。

我最初的想法是在每列下多次保存每个日志(单元格),这是日志中每个字段的值。这将导致数据大小增加约 20 倍,但我认为这会很好地提高性能。行键将是带有前缀的时间戳,即源 ID。
每个源会产生大约 40-100M 的日志行(可能有数万个源)。
我还需要低延迟,可能低于 10 秒(因此目前不能选择 Hive 等解决方案)

你认为这是正确的模式设计吗?如果不是你认为正确的,或者我应该使用其他东西(什么)?
感谢您的所有回答。

4

1 回答 1

4

We're doing something similar with weblogs. We're doing something slightly more complicated than the case you present but I can see similarities in issues that could be encountered.

We created tables in hive to store the various data we are collecting then have a job to run queries and load that data into tables in HBase pre-aggregated.

This helps reduce the level of data increase and duplication as the raw data is only stored once, then the aggregations you want are stored. Using Hive to store raw data allows greater ease in flexibility to aggregate by different dimensions and various manipulations of the data.

Depending on what your specific goals are, HBase might be the only requirement for storage, but if the goal is to aggregate and analyze data, I think Hive and HBase would work together better.

If your results are not needed 'real time' then just using hive to store the raw data and generating reports from a query may also be an acceptable solution.

I am, by no means, a definitive resource on setups for the HStack. I wasn't even a key member in the design of our existing system. I have encountered a situation where we couldn't store data in hbase and retrieve it while maintaining an optimal setup/organization for hbase. The method we needed to store data to retrieve it would result in a lot of headaches in other areas.

I hope my ramblings have provided some help in some fashion. :)

于 2011-04-20T17:16:46.940 回答