可能是一个非常蹩脚的问题。我有两个文档,我想以 map reduce 方式找到两个文档的重叠,然后比较重叠(假设我有一些措施可以做到这一点)
所以这就是我的想法:
1) Run the normal wordcount job on one document (https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework)
2) But rather than saving a file, save everything in a HashMap(word,true)
3) Pass that HashMap along the second wordcount mapreduce program and then as I am processing the second document, check the words against the HashMap to find whether the word is present or not.
所以,像这样
1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
2) runSteptwo(HashMap<String, boolean>)
我如何在hadoop中做到这一点