1

So I have a MapReduce job that takes in multiple news articles and outputs the following key value pairs.

.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.

As you can see, there are two main different types of data rows that I am currently outputting and right now, these get mixed up in the flat files outputted by mapreduce. Is there anyway I can simply output social_tags to file1 and topic_codes to file2 OR maybe output social_tags to a specified group of files(social1.txt, social2.txt ..etc) and topic_codes to another group (topic1.txt, topic2.txt...etc)

The reason I'm asking this is so that I can store all these into a Hive table later on easily. I preferably would want to have a separate table for each different data type(topic_code, social_tag,... etc.) If any of you guys know a simple way to achieve this without separating the mapreduce output to different files, that would be really helpful too.

Thanks in advance!

4

2 回答 2

2

You can use MultipleOutputs as already suggested. As you have asked for a simple way to achieve this without separating the mapreduce output to different files. Here is a quick way, if the amount of data is not real huge !!!. And the logic to differentiate the data is not too complex.

First load the mixed output file into a hive table (say main_table). Then you can create two different tables (topic_code, social_tag), and insert the data from the main table after filtering it by where clause.

    hive > insert into table topic_code
         > select * from main_table
         > where $condition;

    // $condition = the logic you would use to differentiate the records in the MR job
于 2013-07-19T14:22:27.953 回答
1

I think you can try MultipleOutputs present in hadoop API. MultipleOutputs allows you to write data to files whose names are derived from the output keys and values, or in fact from an arbitrary string. This allows each reducer (or mapper in a map-only job) to create more than a single file. File names are of the form name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an arbitrary name that is set by the program, and nnnnn is an integer designating the part number, starting from zero.

In the reducer, where we generate the output, we construct an instance of MultipleOutputs in the setup()method and assign it to an instance variable. We then use the MultipleOutputsinstance in the reduce()method to write to the output, in place of the context. The write()method takes the key and value, as well as a name.

You can look into the below link for details

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

于 2013-07-19T11:49:29.460 回答