0

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.

I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.

I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.

How could I do ?

Thanks

4

1 回答 1

1

他们是对的。由于 .doc 和 .docx 是二进制格式,简单的文本加载器将无法工作。您可以编写 UDF 以便能够将文件直接加载到 Pig 中,或者您可以进行一些预处理以将所有 .doc 和 .docx 文件转换为 .txt 文件,以便 Pig 将加载那些 .txt 文件。链接可以帮助您开始寻找转换文件的方法。

但是,我仍然建议学习编写 UDF。预处理文件将增加可以避免的大量开销。

更新:这里有一些我过去用来编写 java (Load) UDF 的资源。

于 2013-08-29T17:01:33.653 回答