python - Python pdfminer pdf2html：撇号转换为特殊字符

Question

我在 Python 中使用 pdfminer 包将 PDF 转换为 HTML，但它将撇号转换为特殊字符。例子：

â€˜This is a text between apostrophesâ€™

应该：

'This is a text between apostrophes'

有什么方法可以将特殊字符转换回撇号或更改编码之类的吗？我对字符编码不是很熟悉。也许我可以选择一种编码来转换为 HTML？

score 0 · Accepted Answer

我假设引号是 Unicode 字符“左单引号”（U+2018）和“右单引号”（U+2019）。以 utf-8 编码的它们是：

'\xe2\x80\x98This is a text between apostrophes\xe2\x80\x99'

本文中的字节为：

'\xc3\xa2\xe2\x82\xac\xcb\x9cThis is a text between apostrophes\xc3\xa2\xe2\x82\xac\xe2\x84\xa2'

这是每个引号 8 个字节，这让我想知道字符串是否被多次编码。我尝试了几种组合，例如：

>>> u'\u2018'.encode('utf-8').decode('iso-8859-1').encode('utf-8')
'\xc3\xa2\xc2\x80\xc2\x98'

不幸的是，我无法重现您得到的结果。

1 回答 1