hadoop - 使用 Hadoop 将 word 文档转换为 pdf

Question

假设我想将 1000 个 word 文件转换为 pdf，那么使用 Hadoop 来解决这个问题是否有意义？与简单地使用多个 EC2 实例和作业队列相比，使用 Hadoop 有什么优势吗？

此外，如果有 1 个文件和 10 个空闲节点，那么 hadoop 会拆分文件并将其发送到 10 个节点，还是将文件仅发送到 1 个节点而 9 个节点空闲？

score 2 · Accepted Answer

在这个用例中使用 hadoop 并没有太大的优势。让竞争的消费者从队列中读取并产生输出将更容易设置，并且可能会更有效。

Hadoop 不会自动拆分不同节点上的文档和进程部分。虽然如果你有一个非常大的（数千页长），那么 Hadoop 用例将是有意义的——但只有当在单台机器上生成 pdf 的时间很重要时。

map 任务每个可以打印几千页，reduce 任务将 PDF 合并到一个文档中——尽管如果结果文件非常大，阅读结果文件可能会很困难。

score 1 · Accepted Answer

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.

Good luck with whatever tool you choose for this problem!

Regards, Jeff

score 0 · Accepted Answer

你说的是多少个1000？如果这是一次性批次，我会将其设置在一台机器上并让它运行，您会惊讶于我认为您可以将 1000 份文档转换为 PDF 的速度有多快，即使您需要运行任务几天后，如果它是一次性转换，那么就不需要像 Hadoop 这样的复杂性了。如果您不断地转换 1000 多个文档，那么它可能值得努力设置其他东西。

hadoop - 使用 Hadoop 将 word 文档转换为 pdf

3 回答 3

Related

Reference