0

我写了一个简单的程序来收集一些数据中关于二元组的一些统计数据。我将统计信息打印到自定义文件中。

Path file = new Path(context.getConfiguration().get("mapred.output.dir") + "/bigram.txt");
FSDataOutputStream out = file.getFileSystem(context.getConfiguration()).create(file);

我的代码有以下几行:

Text.writeString(out, "total number of unique bigrams: " + uniqBigramCount + "\n");
Text.writeString(out, "total number of bigrams: " + totalBigramCount + "\n");
Text.writeString(out, "number of bigrams that appear only once: " + onceBigramCount + "\n");

我在 vim/gedit 中得到以下输出:

'total number of unique bigrams: 424462
!total number of bigrams: 1578220
0number of bigrams that appear only once: 296139

除了行首不需要的字符外,还有一些非打印字符。这背后的原因可能是什么?

4

1 回答 1

1

正如@ThomasJungblut 所说, writeString 方法为每次调用 writeString 写出两个值 - 字符串的长度(作为 vint)和字符串字节:

/** Write a UTF8 encoded string to out
 */
public static int writeString(DataOutput out, String s) throws IOException {
  ByteBuffer bytes = encode(s);
  int length = bytes.limit();
  WritableUtils.writeVInt(out, length);
  out.write(bytes.array(), 0, length);
  return length;
}

如果您只是希望能够将文本输出打印到该文件(即所有人类可读),那么我建议您out用 a 包装变量PrintStream,并使用 println 或 printf 方法:

PrintStream ps = new PrintStream(out);
ps.printf("total number of unique bigrams: %d\n", uniqBigramCount);
ps.printf("total number of bigrams: %d\n", totalBigramCount);
ps.printf("number of bigrams that appear only once: %d\n", onceBigramCount);
ps.close();
于 2012-07-25T10:35:15.293 回答