2

我有很多不同的文件 *.doc、*.pdf 等等。我想用 mapReduce 处理它们。

我将它们放在 HDFS 中,然后使用 Hue 启动 java MapReduce 程序。

如果文件格式正确且名称中没有方括号“(){}[]”,则一切正常。

但是如果有文件OPN_last_[age.PDF

我得到这个错误:

    Failing Oozie Launcher, Main class [distr.fors.ru.Index], main() threw exception, Illegal file pattern: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    java.io.IOException: Illegal file pattern: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:70)
    at org.apache.hadoop.fs.GlobFilter.<init>(GlobFilter.java:49)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1670)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1627)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
    at distr.fors.ru.Index.run(Index.java:78)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at distr.fors.ru.Index.main(Index.java:39)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:495)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: java.util.regex.PatternSyntaxException: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    at org.apache.hadoop.fs.GlobPattern.error(GlobPattern.java:167)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:151)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:66)
    ... 32 more

如果有这样的文件:{2011-01-27} (3769330).pdf

我收到这样的错误:

    Input Pattern hdfs://fd-bigdata.distr.fors.ru:8020/{2011-01-27} (3769330).pdf matches 0 files 
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) 
    t org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) 
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) 
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) 
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) 
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) 
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) 
    at distr.fors.ru.Index.run(Index.java:76) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at distr.fors.ru.Index.main(Index.java:37) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:495) 
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) 
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) 
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

我真的需要处理这样的文件。我能做些什么来解决这些问题?

PS我使用的是最新的CDH 4.4.0。

4

1 回答 1

2

要处理 Java 中的特殊字符,您应该使用双反斜杠 '\' 转义它们:

'[' => '\\['
'}' => '\\}' 

这适用于我在 Java、Pig 和 Oozie 中。希望它也能解决你的问题。

于 2013-10-15T10:35:21.597 回答