hadoop - Is Hadoop the right tech for this?

Question

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.

score 2 · Accepted Answer

Map-Reduce 是为可以并行化的算法而设计的，并且可以计算和聚合本地结果。一个典型的例子是计算文档中的单词。您可以将其拆分为多个部分，在其中计算一个节点上的一些单词，另一个节点上的一些单词等，然后将总数相加（显然这是一个微不足道的示例，但说明了问题的类型）。

Hadoop 专为处理大型数据文件（例如日志文件）而设计。默认块大小为 64MB，因此拥有数百万条小记录并不适合 Hadoop。

为了解决数据结构不统一的问题，您可以考虑使用 NoSQL 数据库，该数据库旨在处理大量 a 列为空的数据（例如 MongoDB）。

score 1 · Accepted Answer

Hadoop/MR 是为批处理而非实时处理而设计的。因此，必须考虑其他一些替代方案，如Twitter Storm、HStreaming 。

此外，查看Hama对数据的实时处理。请注意，Hama 中的实时处理仍然很粗糙，需要做很多改进/工作。

score 1 · Accepted Answer

我会推荐 Storm 或 Flume。在其中任何一个中，您都可以在每条记录进入时对其进行分析并决定如何处理它。

score 0 · Accepted Answer

如果您的数据量不是很大，并且数百万条记录听起来不是这样，我建议您尝试从 RDMBS 中获得最大收益，即使您的架构不会被正确规范化。我认为即使是结构 K1、K2、K3、Blob 也会更有用
。在 NoSQL 中，KeyValue 存储是为支持各种风格的无模式数据而构建的，但它们的查询能力是有限的。
唯一我认为有用的情况是 MongoDB/CoachDB 能够索引无模式数据。您将能够通过某些属性值获取记录。
关于 Hadoop MapReduce - 我认为它没有用，除非您想利用大量 CPU 进行处理或拥有大量数据或需要分布式排序功能。

hadoop - Is Hadoop the right tech for this?

4 回答 4

Related

Reference