hadoop - Pig - load Word documents (.doc & .docx) with pig

Question

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.

I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.

I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.

How could I do ?

Thanks

score 1 · Accepted Answer

他们是对的。由于 .doc 和 .docx 是二进制格式，简单的文本加载器将无法工作。您可以编写 UDF 以便能够将文件直接加载到 Pig 中，或者您可以进行一些预处理以将所有 .doc 和 .docx 文件转换为 .txt 文件，以便 Pig 将加载那些 .txt 文件。此链接可以帮助您开始寻找转换文件的方法。

但是，我仍然建议学习编写 UDF。预处理文件将增加可以避免的大量开销。

更新：这里有一些我过去用来编写 java (Load) UDF 的资源。一，二。

hadoop - Pig - load Word documents (.doc & .docx) with pig

1 回答 1

Related

Reference