2

I'm writing a custom Loader for pig. It is supposed to read delimited records that might span into multiple lines. Everything works, except that sometimes a split happens in the middle of a record and messes everything. I know RecordReader and InputFormat have to do with the place the files are split, but can't figure out how to make it work in my case. To me, it looks like the CSVExcelStorage should have the same problem, but I can't find any code to handle this.

4

1 回答 1

0

CSVExcelStorage 假设没有任何嵌入的换行符,因此没有处理它们的代码。

你是对的,RecordReader 是这里的罪魁祸首。您需要编写一个新的记录阅读器类来理解您的数据,从而了解哪些换行符是分割位置的候选者,哪些换行符只是数据的一部分。一旦你写了一个新的记录类,你就需要一个新的 InputFormatType 来使用那个记录阅读器类。

于 2012-10-01T19:15:51.650 回答