java - 在 MapReduce 中使用 globStatus 过滤输入文件

Question

我有很多输入文件，我想根据最后附加的日期处理选定的文件。我现在对在哪里使用 globStatus 方法来过滤文件感到困惑。

我有一个自定义 RecordReader 类，我试图在它的下一个方法中使用 globStatus 但它没有成功。

public boolean next(Text key, Text value) throws IOException {
    Path filePath = fileSplit.getPath();

    if (!processed) {
        key.set(filePath.getName());

        byte[] contents = new byte[(int) fileSplit.getLength()];
        value.clear();
        FileSystem fs = filePath.getFileSystem(conf);
        fs.globStatus(new Path("/*" + date));
        FSDataInputStream in = null;

        try {
            in = fs.open(filePath);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        processed = true;
        return true;
    }
    return false;
}

我知道它返回一个 FileStatus 数组，但我如何使用它来过滤文件。有人可以阐明一下吗？

score 10 · Accepted Answer

该globStatus方法采用 2 个免费参数，可让您过滤文件。第一个是 glob 模式，但有时 glob 模式不足以过滤特定文件，在这种情况下，您可以定义一个PathFilter.

关于 glob 模式，支持以下内容：

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------
*      | Matches zero or more characters
?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter

PathFilter只是一个这样的界面：

public interface PathFilter {
    boolean accept(Path path);
}

因此，您可以实现此接口并实现accept可以将逻辑用于过滤文件的方法。

一个来自Tom White 的优秀书籍的例子，它允许你定义一个PathFilter过滤匹配某个正则表达式的文件：

public class RegexExcludePathFilter implements PathFilter {
    private final String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

您可以通过在初始化作业时PathFilter调用来直接使用实现过滤您的输入。FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class)

编辑：由于您必须在中传递类setInputPathFilter，因此您不能直接传递参数，但您应该能够通过使用Configuration. 如果您RegexExcludePathFilter也从扩展Configured，您可以取回一个Configuration您之前使用所需值初始化的对象，因此您可以在过滤器中取回这些值并在accept.

例如，如果您像这样初始化：

conf.set("date", "2013-01-15");

然后你可以像这样定义你的过滤器：

public class RegexIncludePathFilter extends Configured implements PathFilter {
    private String date;
    private FileSystem fs;

    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            }
        } catch (IOException e) {}
        return path.toString().endsWith(date);
    }

    public void setConf(Configuration conf) {
        if (null != conf) {
            this.date = conf.get("date");
            try {
                this.fs = FileSystem.get(conf);
            } catch (IOException e) {}
        }
    }
}

编辑 2：原始代码存在一些问题，请参阅更新后的类。您还需要删除构造函数，因为它不再使用，并检查它是否是一个目录，在这种情况下您应该返回 true，以便也可以过滤目录的内容。

score 3 · Accepted Answer

对于阅读本文的任何人，我可以说“请不要在过滤器中做任何比验证路径更复杂的事情”。具体来说：不要检查文件是否是目录、获取它们的大小等。等到 list/glob 操作返回，然后使用现在填充FileStatus条目中的信息在那里进行过滤。

为什么？所有那些getFileStatus()直接或通过isDirectory()对 . 更关键的是，针对 S3 和其他对象存储，每个操作都可能发出多个 HTTPS 请求——而这些请求确实需要相当长的时间。更好的是，如果 S3 认为您在整个机器集群中发出过多请求，它会限制您。你不想要那个。

直到调用之后——您返回的文件状态条目是来自对象存储的列表命令的那些，通常每个 HTTPS 请求返回数千个文件条目，因此效率更高。

有关更多详细信息，请检查org.apache.hadoop.fs.s3a.S3AFileSystem.

java - 在 MapReduce 中使用 globStatus 过滤输入文件

2 回答 2

Related

Reference