java - 使用分布式缓存分发小型查找文件的最佳方式

Question

获取分布式缓存数据的最佳方式是什么？

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    ArrayList<String> globalFreq = new ArrayList<String>();
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }
    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //Accessing "globalFreq" data .and do further processing
        }

或者

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    URI[] cacheFiles
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        cacheFiles = DistributedCache.getCacheFiles(conf);

    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        ArrayList<String> globalFreq = new ArrayList<String>();
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }

        }

因此，如果我们这样做（代码 2）是否意味着Say we have 5 map task every map task reads the same copy of the data。在为每个地图编写这样的内容时，该任务会多次读取数据，对吗（5次）？

代码 1：因为它是在 setup 中编写的，所以它被读取一次，并且在 map 中访问全局数据。

这是编写分布式缓存的正确方法。

score 0 · Accepted Answer

在方法中做尽可能多的事情setup：这将被每个映射器调用一次，但随后将为传递给映射器的每条记录缓存。为每条记录解析数据是您可以避免的开销，因为没有任何东西取决于您在方法中接收的key,value和context变量map。

该setup方法将按地图任务调用，但map会为该任务处理的每条记录调用（这显然是一个非常高的数字）。

java - 使用分布式缓存分发小型查找文件的最佳方式

1 回答 1

Related

Reference