hadoop - 通过 DistributedCache 读取本地文件时出现 OutofMemoryError

Question

2012 年 11 月 21 日更新：

通过将属性 mapred.child.java.opts 设置为 -Xmx512m 解决了问题。在此之前，我在 core-site.xml 中将 HADOOP_HEAPSIZE 设置为 2000，但这并没有帮助。我仍然不明白为什么该程序在本地工作，但它不是分布式的。感谢所有的答案。

我正在使用 Hadoop 1.0.3。该集群由三台机器组成，它们都运行 Ubuntu Linux 12.04 LTS。其中两台机器有 12 GB 的 RAM，第三台有 4 GB。我正在通过 DistributedCache 读取大约 40 MB 的本地文件。我的程序在本地环境（本地/独立模式）中完美运行。但是，当我在 Hadoop 集群（完全分布式模式）中执行它时，我得到一个“OutOfMemoryError: Java heap space”，具有相同的 40 MB 文件。我不明白为什么会发生这种情况，因为文件不是那么大。这是代码：

    public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
    // ...
    private HashMap<String, String> urlTrad = new HashMap<String, String>();
    // ...
    @Override
    public void configure(JobConf job) {
        Path[] urlsFiles = new Path[0];
        BufferedReader fis;

        try {
            urlsFiles = DistributedCache.getLocalCacheFiles(job);
            fis = new BufferedReader(new FileReader(
                    urlsFiles[0].toString()));
            String pattern;
            while ((pattern = fis.readLine()) != null) {
                String[] parts = pattern.split("\t");
                urlTrad.put(parts[0], parts[1]);
            }
            fis.close();

        } catch (IOException ioe) {
            System.err
                    .println("Caught exception while parsing the cached file '"
                    + urlsFiles[0]
                    + "' : "
                    + StringUtils.stringifyException(ioe));
        }
    }
    // ...

任何帮助将不胜感激，在此先感谢。

score 1 · Accepted Answer

通过将属性 mapred.child.java.opts 设置为 -Xmx512m 解决了问题。在此之前，我在 core-site.xml 中将 HADOOP_HEAPSIZE 设置为 2000，但这并没有帮助。我仍然不明白为什么该程序在本地工作，但它不是分布式的。

hadoop - 通过 DistributedCache 读取本地文件时出现 OutofMemoryError

1 回答 1

Related

Reference