I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.
I have written the following code to acheive it :-
public class TfIdf {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
{
TfIdf tfIdf = new TfIdf();
tfIdf.runWordCount();
tfIdf.runDocWordCount();
tfIdf.TFIDFComputation();
}
public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Count calculation");
job.setMapperClass(WordFrequencyMapper.class);
job.setReducerClass(WordFrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("ouput"));
job.waitForCompletion(true);
}
public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Doc count calculation");
job.setMapperClass(WordCountDocMapper.class);
job.setReducerClass(WordCountDocReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
job.waitForCompletion(true);
}
public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("TFIDF calculation");
job.setMapperClass(TFIDFMapper.class);
job.setReducerClass(TFIDFReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output_job2"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
job.waitForCompletion(true);
}
}
However I get the error:
Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output
Could anyone help me out with this?