hadoop - What approximate amount of semistructured data is enough for setting up Hadoop cluster?

Question

I know, Hadoop is not only alternative for semistructured data processing in general — I can do many things with plain tab-separated data and a bunch of unix tools (cut, grep, sed, ...) and hand-written python scripts. But sometimes I get really big amounts of data and processing time goes up to 20-30 minutes. It's unacceptable to me, because I want experiment with dataset dynamically, running some semi-ad-hoc queries and etc.

So, what amount of data do you consider enough to setting Hadoop cluster in terms of cost-results of this approach?

score 2 · Accepted Answer

不知道你在做什么，这里是我的建议：

如果您想对数据运行即席查询，Hadoop 并不是最好的选择。您是否尝试过将数据加载到数据库中并对其运行查询？
如果您想尝试使用 Hadoop 而无需设置集群的成本，请尝试使用 Amazon 的 Elastic MapReduce 产品http://aws.amazon.com/elasticmapreduce/
我个人看到人们使用 shell 脚本来完成这些任务已经走得很远了。您是否尝试过使用 SSH 在机器上分发您的工作？GNU Parallel 使这很容易：http ://www.gnu.org/software/parallel/

score 1 · Accepted Answer

我认为这个问题有几个方面。第一个——你可以用 MySQL/Oracle 等常用的 SQL 技术实现什么。如果你能用它们得到解决方案——我认为这将是更好的解决方案。

还应该指出的是，表格数据的 hadoop 处理将比传统的 DBMS 慢得多。所以我要谈第二个方面——你准备好用超过 4 台机器构建 hadoop 集群了吗？我认为 4-6 台机器是获得一些收益的最低要求。

第三个方面是 - 您是否准备好等待数据加载到数据库 - 这可能需要时间，但查询会很快。因此，如果您对每个数据集进行一些查询 - 它具有 hadoop 优势。

回到最初的问题——我认为您至少需要 100-200 GB 的数据，这样 Hadoop 处理才会有意义。我认为 2 TB 清楚地表明 hadoop 可能是一个不错的选择。

hadoop - What approximate amount of semistructured data is enough for setting up Hadoop cluster?

2 回答 2

Related

Reference