java - hadoop映射器读取多行

Question

hadoop 新手 - 我正在尝试以块的形式读取我的 HDFS 文件，例如 - 一次 100 行，然后使用映射器中的 apache OLSMultipleLinearRegression 对数据进行回归。我正在使用此处显示的此代码读取多行：http ://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

我的映射器定义为：

public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
    String lines = value.toString();
    String []lineArr = lines.split("\n");
    int lcount = lineArr.length;
    System.out.println(lcount); // prints out "1"
    context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}

我的问题是：为什么 lcount==1 来自 system.out.println？我的文件由“\n”分隔，并且我在记录阅读器中设置了 NLINESTOPROCESS = 3。我的输入文件格式为：

y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...

如果我一次只读取 1 行，我将无法执行多重回归，因为回归 API 需要多个数据点......谢谢您的帮助

score 0 · Accepted Answer

String.split()将正则表达式作为参数。你必须双重逃避。

String []lineArr = lines.split("\\n");

java - hadoop映射器读取多行

1 回答 1

Related

Reference