2

在网上,我看到很多规范字数图减少的例子。我理解 k,v => 的映射器输入以减少 k,list(v) 的输入。map reduce 带来了一些魔力。我不太明白如何将 mapreduce 应用于更实际的示例。例如:假设我有一个文件,其中包含美国所有员工的工资以及其他一些详细信息,例如州和城市等... mapreduce 如何工作以提供包含汇总的以下列的输出报告?州,城市,平均(工资)

在 SQL 中,我可以通过这样的查询得到它:

Select state, city, avg(salaries) 
From employee_tbl
Group by state, city

map reduce 将如何为我提供上述结果集。我使用过 hive,但我不知道如何将 SQL 转换为 map 和 reduce。

4

2 回答 2

4

在 map-reduce 作业中隐藏 SQL 查询的一种简单方法是在 Hadoop 上使用 HIVE。

但是,如果您不想这样,可以在大多数示例中应用一个简单的经验法则,同时将 SQL 查询模拟到 map-reduce 作业是 -

Map 函数中的 Key-Out 是 group by 子句中的列。

在您的示例中,让 state-city 成为一个键,您将在 Map 函数中输出它(在它们之间使用一些分隔符)。

Map 函数中的值输出是要在其上运行聚合函数的列。

在您的示例中,它将是个人薪水(如果要聚合的列超过 1 个,则可以用相同的分隔符将它们分开)。

Key-in in Reduce will be the same as key-out of Map function

.

Value-out in reduce function will be the value after running aggregation function over value-out of all rows which have the same key

So in this case you will just sum up all the value-in(salary) and value-out will be the sum of salaries in a unique 'state-city' pair.

于 2013-01-31T07:00:18.550 回答
1

If you want to directly translate a SQL query to a set of Map/Reduce jobs, you should definitely take a look at YSmart. It is just a SQL to Map/Reduce built on top of Hadoop. Also some studies have shown it might be faster than Hive, although I can't back this claim as I haven't tested it myself.

As taken from their docs, YSmart provides:

  • High Performance: The MapReduce programs generated by YSmart are optimized. YSmart can automatically detect and utilize intra-query correlations when translating a query. This correlation-aware ability significantly reduces redundant computation, unnecessary disk IO operations and network overhead. See the Performance page to learn the performance benefits of YSmart.

  • High Extensibility: YSmart is easy to modify and extend. It is designed with the goal of extensibility. The major part of YSmart is implemented in Python which makes the codes much easier to understand. Due to its modularity and script nature, users can easily modify the current functionalities or add new functionalities to YSmart.

  • High Flexibility: YSmart can run in two different modes: translation-mode and execution-mode. In the translation-mode, YSmart only translates the query into Java codes while in the execution-mode YSmart will also compile and execute the generated codes. Because of this flexibility, users can easily read, modify and customize the generated codes.

于 2013-01-31T19:15:03.793 回答