hadoop - 使用分布式缓存读取文件

Question

我有很多文件存储在分布式缓存中，每个文件对应一个用户 ID。我想将与特定用户 id 对应的特定文件（这将是 reducer 的键）附加到特定的 reduce 任务。但我不能这样做，因为我使用 configure 方法从分布式缓存中读取文件，该方法位于 reduce 类中的 reduce 方法之前。所以我不能在reduce类的configure方法中访问reduce方法的键，因此不能只读取我想要的文件。请帮助我。

class reduce{

void configure(args)
{

/*I can a particular file from the Path[] here.
I want to select the  file corresponding to the key of the reduce method and pass its
contents to the reduce method. I am not able to do this as I can't access the key of 
the reduce method.*/

}

void reduce(args)
{
}


}

score 1 · Accepted Answer

一种解决方案是Path在配置步骤中将 DistributedCache 中的数组分配给类变量，如 DistributedCache javadocs中所述。当然，用你的 reduce 代码替换 map 代码。

这是使用旧 API，看起来您的代码正在使用它。

 public static class MapClass extends MapReduceBase  
 implements Mapper<K, V, K, V> {

   private Path[] localArchives;
   private Path[] localFiles;

   public void configure(JobConf job) {
     // Get the cached archives/files
     localArchives = DistributedCache.getLocalCacheArchives(job);
     localFiles = DistributedCache.getLocalCacheFiles(job);
   }

   public void map(K key, V value, 
                   OutputCollector<K, V> output, Reporter reporter) 
   throws IOException {
     // Use data from the cached archives/files here
     // ...
     // ...
     output.collect(k, v);
   }
 }

hadoop - 使用分布式缓存读取文件

1 回答 1

Related

Reference