hadoop - Record Reader Split to convert Fixed Length to Delimited ASCII file

Question

I have a file which is of 128 MB so it is splitted into 2 blocks (Block size =64 MB ). I am trying to convert a Fixed Length File to a Delimited ASCII File using Custom Record Reader class

Problem:

When the first split of the file is processed I am able to get the records properly when I see with a hive table on top of the data it is also accessing data node2 to fetch characters until the end of the record. But, the second split is starting with a \n character and also the number of records is getting doubled.

Ex: 
First Split: 456   2348324534         34953489543      349583534
Second Split:
456         23           48324534             34953489543      349583534

As part of the record reader inorder to skip the characters which is read in the first input split the following piece of code is added

FixedAsciiRecordReader(FileSplit genericSplit, JobConf job) throws IOException {
if ((start % recordByteLength) > 0) {
              pos = start - (start % recordByteLength) + recordByteLength;
           }
           else {
              pos = start;
           }

           fileIn.skip(pos);
}

The Input Fixed Length file has a \n character at the end of each record.

Should Any value be set to the start variable as well?

score 0 · Accepted Answer

我找到了解决这个问题的方法，我的输入固定长度文件中有一个可变长度的标头，它没有被跳过，所以位置并不是从记录的开头开始，而是从位置开始（StartofRecord - HeaderLength）。这使得每条记录都从前一条记录中读取几个字符（与标题长度一样多）。

更新代码：

 if ((start % recordByteLength) > 0) {
        pos = start - (start % recordByteLength) + recordByteLength + headerLength;
    }
    else {
        pos = start;            
    }

    fileIn.skip(pos);

hadoop - Record Reader Split to convert Fixed Length to Delimited ASCII file

1 回答 1

Related

Reference