java - 如何使用开源 Java 库正确地将 DOCM 转换为 PDF？

Question

我开始研究如何将.docm文件转换为 PDF 文件。据我所知，只有将 .docx 转换为 pdf 的开源库。我的解决方案是寻找一种将 .docm 转换为 .docx 的方法，同时保留所有信息。为此，我找不到合适的开源解决方案，但我找到了 apache-poi 的提交（链接）。使用在该提交中找到的代码，我设法使用我的 .docm 文件所具有的所有信息创建了 .docx 文件。

        String dir = "<directory>";
    for (int i = 1; i < 41; i++) {
        File f = new File(dir + File.separator + i + ".docm");
        File target = new File(dir + "output" + i + ".docx");
        try {
            new DocumentConverter(f).toDocx(target);
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }
    }

我从链接中复制了代码并以上述方式使用它。

获得包含所有信息的 .docx 文件后，我开始将它们转换为 .pdf 文件。为此，我找到了 2 个可能的开源库，docx4j和documents4j。

Docx4j 转换为 pdf 代码：

    try {
            Docx4J.toPDF(WordprocessingMLPackage.load(target), new FileOutputStream(dir + "out" + i + ".pdf"));
        } catch (FileNotFoundException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        } catch (Docx4JException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }

这将为我生成一个 pdf 文件，其中包含除 MS Word 的注释之外的所有信息。

Documents4j 转换为 pdf 代码：

try (ByteArrayOutputStream bo = new ByteArrayOutputStream()) {
                try (InputStream in = new BufferedInputStream(new FileInputStream(target));) {
                    IConverter converter = LocalConverter.builder()
                            .baseFolder(new File(dir))
                            .workerPool(20, 25, 2, TimeUnit.SECONDS)
                            .processTimeout(5, TimeUnit.SECONDS)
                            .build();

                    Future<Boolean> conversion = converter
                            .convert(in).as(DocumentType.DOC)
                            .to(bo).as(DocumentType.PDF)
                            .prioritizeWith(1000) // optional
                            .schedule();
                    conversion.get();
                    try (OutputStream outputStream = new FileOutputStream("out"+ i +".pdf")) {
                        bo.writeTo(outputStream);
                    }
                    converter.shutDown();
                } 
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (InterruptedException | ExecutionException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

这将为我生成一个看起来不错的 pdf 文件，其中包含 MS Word 的注释。

进一步的测试表明 docx4j pdf 在文本中是准确的，但位置发生了变化（例如：段落合并或分成两部分）。来自documents4j 的PDF 在位置上更准确，但就像我说的那样，它们缺少信息。我的测试是在以相同方式创建的表单文档上进行的，丢失的信息总是在同一个地方。

我的问题如下：

是否有经过认证的方法可以使用开源库将 .docm 文件正确转换为 .docx 文件？
当我使用documents4j创建pdf时出了什么问题？
如何在 docx4j 的帮助下包含 MS Word 的评论？
我选择的图书馆有其他选择吗？（仅限开源）

编辑：我忘了包括我正在使用每个库的最新版本。

score 0 · Accepted Answer

documents4j 通过 VBS 脚本将实际工作委托给 MS Word，因此，结果的任何更改都是由于脚本中的配置。您可以尝试使用它，看看是否可以让 Word 包含您缺少的内容：https ://github.com/documents4j/documents4j/blob/master/documents4j-transformer-msoffice/documents4j-transformer-msoffice- word/src/main/resources/word_convert.vbs

只需构建项目并查看更改如何影响输出。

java - 如何使用开源 Java 库正确地将 DOCM 转换为 PDF？

1 回答 1

Related

Reference