hadoop - need a solution for archiving logs and having real-time search functionality

Question

I've been considering following options.

senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs
riak[http://wiki.basho.com/Riak-Search.html]
vertica - cost factor?
Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this

Main requirements are 1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search

storage (log archives + index ) has to be optimal

score 1 · Accepted Answer

有许多专门的日志存储和索引，我不知道我是否必须将日志塞进普通的数据存储中。

如果你有很多钱，很难击败Splunk。

如果您更喜欢开源选项，请查看ServerFault 讨论。logstash + ElasticSearch 似乎是一个非常强大的选择，并且应该像您的日志一样增长得很好。

score 0 · Accepted Answer

对于 2-3 TB 的数据听起来像是“中间”的情况。如果是所有数据，我不建议进入 BigData / NoSQL 冒险。
我认为具有全文搜索功能的 RDBMS 应该在良好的硬件上运行。我建议按时间进行一些积极的分区，以便能够处理 2-3 TB 数据。如果没有分区，那就太马赫了。同时-如果您的数据将按天分区，我认为数据大小对于 MySQL 来说是可以的。
考虑到下面的评论，数据大小约为 10-15TB，并考虑到一些复制的需要将这个数字乘以 x2-x3。我们还应该考虑索引的大小，我估计它是数据大小的几十个百分点。可能有效的单节点解决方案可能比集群更昂贵，主要是因为许可成本。
据我所知，现有的 Hadoop/NoSQL 解决方案无法开箱即用地满足您的要求，主要是因为要索引的文档数量。万一 - 每个日志都是一个文档。（http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/）
所以我认为解决方案是将日志汇总一段时间，并将其作为一个文档进行威胁。
对于这些日志包的存储，HDFS 或 Swift 可能是一个很好的解决方案。

score 0 · Accepted Answer

你有没有想过这些实现的路线。为您的问题集成 Lucene 和 Hadoop 可能会有所帮助。

http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/ http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part- 2/

因此，您的用例可以使用日志文件和参数来索引，而不是电子邮件。

hadoop - need a solution for archiving logs and having real-time search functionality

3 回答 3

Related

Reference