1

I'm a beginner with pig and hadoop. I'm trying to understand what's going on behind the scenes in this simple pig script. I'm reading in some data, splitting it into three new relations, and storing each in a different directory. The script runs on my psuedo distributed hadoop installation as one map-only job.

I have been trying figure out how I could implement this in plain Java Map/Reduce in a single map-only job. It would be trivial to achieve the filtering/splitting, but I don't know how I'd get a map only job to send different key/value pairs to different outputs. Come to think of it, I don't know how I'd even be able to send output to multiple places in one full Map/Reduce job.

rawTweets = LOAD 'geotaggedTweets' USING PigStorage(',') AS (...);

SPLIT rawTweets INTO usTweets IF country == 'US', gbTweets IF country == 'GB', idTweets IF country == 'ID';

STORE usTweets INTO 'testUSTweets' USING PigStorage(',');
STORE gbTweets INTO 'testGBTweets' USING PigStorage(',');
STORE idTweets INTO 'testIDTweets' USING PigStorage(',');

Edit: Ugghh... I've done it again. I seem to not be able to come up with the answers to my questions until I've gone through the whole process of writing and submitting a SO question. The hadoop class I'm looking for is MultipleOutputs

4

0 回答 0