out-of-memory - 使用 Docx4j 和 PdfBox 将 Docx 转换为图像会导致 OutOfMemoryError

Question

我正在使用 dox4j 和 pdfbox 分两步将 docx 文件的第一页转换为图像，但我目前OutOfMemoryError每次都得到一个。

我已经能够确定在该过程的最后一步引发了异常，而该convertToImage方法正在被调用，但是我一直在使用该方法的第二步来转换 pdf 一段时间没有问题所以我不知道可能是什么原因，除非 dox4j 正在编码 pdf 是一种我尚未测试或已损坏的方式。

我试过用ByteArrayOutputStreama替换FileOutputStreampdf 似乎正确渲染并不比我预期的要大。

这是我正在使用的代码：

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);

((org.docx4j.convert.out.pdf.viaXSLFO.Conversion)c).setSaveFO(File.createTempFile("fonts", ".fo"));
ByteArrayOutputStream os = new ByteArrayOutputStream();
c.output(os, new PdfSettings());

byte[] bytes = os.toByteArray();
os.close();

ByteArrayInputStream is = new ByteArrayInputStream(bytes);

PDDocument document = PDDocument.load(is);

PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);

is.close();
document.close();

编辑为了在这种情况下提供更多上下文，此代码正在 grails web 应用程序中运行。我尝试了该代码的几种不同变体，包括将不再需要的所有内容清空，使用 FileInputStream 和 FileOutputStream 来尝试节省更多物理内存并检查 docx4j 和 pdfbox 的输出，它们似乎都可以正常工作。

我正在使用 docx4j 2.8.1 和 pdfbox 0.7.3，我也尝试过 pdf-renderer，但仍然出现 OutOfMemoryError。我的怀疑是 docx4j 使用了太多内存，但在 pdf 到图像转换之前不会产生错误。

我很乐意将 docx 文件转换为 pdf 或直接转换为图像作为答案的另一种方法，但是我目前正在尝试替换在服务器上运行有问题的 jodconverter。

score 3 · Accepted Answer

我是 XDocreport 团队的一员。

我们最近开发了一个部署在 cloudbees (http://xdocreport-converter.opensagres.cloudbees.net/) 上的小 web 应用程序，它显示了行为转换器。

您可以轻松比较 docx4j 和 xdocreport 用于 PDF 和 Html 转换的行为和性能。

源代码可以在这里找到：

https://github.com/pascalleclercq/xdocreport-demo（REST-Service-Converter-WebApplication子文件夹）。在这里： https ://github.com/pascalleclercq/xdocreport/blob/master/remoting/fr.opensagres.xdocreport.remoting.converter.server/src/main/java/fr/opensagres/xdocreport/remoting/converter/server /ConverterResourceImpl.java

我得到的第一个数字是 Xdocreport 生成 PDF 的速度比 Docx4J 快大约 10 倍。

欢迎反馈。

score 3 · Accepted Answer

终于取得了辉煌的成功！我用XDocReport替换了 docx4j，文档立即转换为 PDF。但是，某些文档似乎存在一些问题，但我希望这是由于创建它们的操作系统造成的，并且可以通过以下方式解决：

PDFViaITextOptions options = PDFViaITextOptions.create().fontEncoding("windows-1250");

使用适当的操作系统，而不仅仅是：

PDFViaITextOptions options = PDFViaITextOptions.create();

默认为当前操作系统。

这是我现在用来从 DOCX 转换为 PDF 的代码：

FileInputStream in = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(in);

PDFViaITextOptions options = PDFViaITextOptions.create();

ByteArrayOutputStream out = new ByteArrayOutputStream();
XWPF2PDFViaITextConverter.getInstance().convert(document, out, options);

byte[] bytes = out.toByteArray();
out.close();

ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);

PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);

is.close();
document.close();

return image;

out-of-memory - 使用 Docx4j 和 PdfBox 将 Docx 转换为图像会导致 OutOfMemoryError

2 回答 2

Related

Reference