2

当我尝试使用 Docverter(通过 API)将 utf-8 编码的 markdown 文件转换为 pdf 时,我只是丢失了非 ASCII 字符。

有什么解决办法吗?

我想转换 .md -> .pdf。也许 Docverter 可以帮助做 .md -> .html,然后我可以为 .html -> .pdf 使用其他一些库/服务?

4

1 回答 1

1

Update (14/10/2013)

The problem with boolean options in Docverter is now solved, so you can now do a direct conversion from md to pdf, passing the option ascii=true to Docverter. This causes the intermediate HTML to use entities instead of utf-8, and thus the resulting pdf is OK.

Original answer

After a lot of research (I also had this same problem), I discovered that the bug is in the html->pdf conversion made by Docverter, which uses Flying Saucer libraries. This conversion ignores any non-ascii char in the HTML input, even if the charset is correctly set to utf-8 in the meta tags.

However, if the HTML contains entities such as ó etc, then Flying Saucer does include those characters, and assuming a font which has the correct encoding (default fonts used by the library are fine), the proper char (ó in this example) is shown in the resulting pdf.

So I ended up with the following approach:

  1. Use Docverter to convert .md -> html
  2. Process the resulting html to use HTML entities instead of utf-8
  3. Use Docverter again to convert .html -> .pdf

Step 2 is easy if you happen to use python. In this case, the following lines do the trick:

def fixHTML(filename):
   f = open(filename, "r")
   content = unicode(f.read(), "utf-8")  # Reads the file into a unicode string
   f.close()
   f = open(filename, "w")
   f.write(content.encode("ascii", "xmlcharrrefreplace")) # Writes with the fixed encoding

Note: This convoluted way should not be required because pandoc accepts the switch --ascii which forces it to produce HTML as the one obtained in step 2. However, Docverter parser for boolean options seems to be broken, so it is not possible to pass the option ascii to Docverter.

于 2013-10-11T15:43:57.883 回答