该globStatus
方法采用 2 个免费参数,可让您过滤文件。第一个是 glob 模式,但有时 glob 模式不足以过滤特定文件,在这种情况下,您可以定义一个PathFilter
.
关于 glob 模式,支持以下内容:
Glob | Matches
-------------------------------------------------------------------------------------------------------------------
* | Matches zero or more characters
? | Matches a single character
[ab] | Matches a single character in the set {a, b}
[^ab] | Matches a single character not in the set {a, b}
[a-b] | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b} | Matches either expression a or b
\c | Matches character c when it is a metacharacter
PathFilter
只是一个这样的界面:
public interface PathFilter {
boolean accept(Path path);
}
因此,您可以实现此接口并实现accept
可以将逻辑用于过滤文件的方法。
一个来自Tom White 的优秀书籍的例子,它允许你定义一个PathFilter
过滤匹配某个正则表达式的文件:
public class RegexExcludePathFilter implements PathFilter {
private final String regex;
public RegexExcludePathFilter(String regex) {
this.regex = regex;
}
public boolean accept(Path path) {
return !path.toString().matches(regex);
}
}
您可以通过在初始化作业时PathFilter
调用来直接使用实现过滤您的输入。FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class)
编辑:由于您必须在 中传递类setInputPathFilter
,因此您不能直接传递参数,但您应该能够通过使用Configuration
. 如果您RegexExcludePathFilter
也从 扩展Configured
,您可以取回一个Configuration
您之前使用所需值初始化的对象,因此您可以在过滤器中取回这些值并在accept
.
例如,如果您像这样初始化:
conf.set("date", "2013-01-15");
然后你可以像这样定义你的过滤器:
public class RegexIncludePathFilter extends Configured implements PathFilter {
private String date;
private FileSystem fs;
public boolean accept(Path path) {
try {
if (fs.isDirectory(path)) {
return true;
}
} catch (IOException e) {}
return path.toString().endsWith(date);
}
public void setConf(Configuration conf) {
if (null != conf) {
this.date = conf.get("date");
try {
this.fs = FileSystem.get(conf);
} catch (IOException e) {}
}
}
}
编辑 2:原始代码存在一些问题,请参阅更新后的类。您还需要删除构造函数,因为它不再使用,并检查它是否是一个目录,在这种情况下您应该返回 true,以便也可以过滤目录的内容。