hadoop - 分布式缓存 Hadoop 未检索文件内容

Question

我得到了一些类似垃圾的值，而不是我想用作分布式缓存的文件中的数据。

作业配置如下：

Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);


DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));

JobClient.runJob(conf5);

在映射器中，我有以下代码：

public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();

Text a = new Text();
public void configure(JobConf conf5) {

Path[] dates = new Path[0];
try {
        dates = DistributedCache.getLocalCacheFiles(conf5);
        System.out.println("candidates: "+candidates);
        String astr = dates.toString();
        a = new Text(astr);

      } catch (IOException ioe) {
        System.err.println("Caught exception while getting cached files: " +   
      StringUtils.stringifyException(ioe));
      }


  }




   public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, 
   Reporter reporter) throws IOException {

     String line = value.toString();
     StringTokenizer st = new StringTokenizer(line);
     st.nextToken();
     String t = st.nextToken();
     String uidi = st.nextToken();
     String uidj = st.nextToken();

     String check = null;

     output.collect(new Text(line), a);



        }


    }

我从这个映射器得到的输出值是：[Lorg.apache.hadoop.fs.Path;@786c1a82
而不是分布式缓存文件中的值。

score 1 · Accepted Answer

这看起来就像您在数组上调用 toString() 时得到的，如果您查看 DistributedCache.getLocalCacheFiles() 的 javadocs，这就是它返回的内容。如果您需要实际读取缓存中文件的内容，您可以使用标准 java API 打开/读取它们。

score 0 · Accepted Answer

从您的代码：

Path[] dates = DistributedCache.getLocalCacheFiles(conf5);

暗示：

String astr = dates.toString();// 是指向上述数组（即日期）的指针，这就是您在输出中看到的 [Lorg.apache.hadoop.fs.Path;@786c1a82。

您需要执行以下操作才能查看实际路径：

for(Path cacheFile: dates){

    output.collect(new Text(line), new Text(cacheFile.getName()));

}

hadoop - 分布式缓存 Hadoop 未检索文件内容

2 回答 2

Related

Reference