所以,使用类似的东西:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " + files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet = object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
当一个人遍历 Hadoop 输出文件时,中间的一个 JSON 对象正在破坏......因为 scan.nextLine() 在将其拆分之前没有获取整行。即,输出是:
->0 {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
上述大部分数据已被清理(但不是 URL(大多数情况下)......)
并且 URL 继续为: $(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007 在文件中......
所以它有点恼火。
这也是条目#112,我已经解析了其他文件而没有错误......但这让我很困惑,主要是因为我没有看到 scan.nextLine() 是如何工作的......
通过调试输出,JSON 错误是由字符串未正确拆分引起的。
几乎忘记了,如果我尝试将有问题的行放在它自己的文件中并解析它,它也可以正常工作。
编辑:如果我在大约同一个地方删除违规行,也会爆炸。
尝试使用 JVM 1.6 和 1.7
解决方法: BufferedReader scan = new BufferedReader(new FileReader(files[i])); 而不是扫描仪......