我在 stackoverflow 中找到并(稍微)修改了这个脚本,使其可以在 python 3.3 上运行:
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
fp = open(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
string = retstr.getvalue()
retstr.close()
return string
print(convert_pdf('abc.pdf'))
它工作正常,但我似乎有两个问题:
在运行脚本时,我收到大量警告:
WARNING:root:undefined: PDFCIDFont: basefont='LKOELN+Wingdings-Regular', cidcoding='Adobe-Identity', 139
WARNING:root:undefined: PDFCIDFont: basefont='LKKPCF+Wingdings2', cidcoding='Adobe-Identity' , 132
印刷文本中的哪个看起来像(cid:139)
,我如何捕捉这个警告并用其他东西替换那个文本?
请注意,我有一个编解码器行,在原始脚本中位于
TextConverter(rsrcmgr, retstr, laparams=laparams)
.回溯(最近一次通话最后):文件“C:/Users/rodrigo/Desktop/csp_pdf/csp_pdf2.py”,第 46 行,在 convert_pdf('abc.pdf') 文件“C:/Users/rodrigo/Desktop/csp_pdf /csp_pdf2.py",第 33 行,在 convert_pdf device = TextConverter(rsrcmgr, retstr, codec = 'utf-8', laparams=laparams) TypeError: init () got an unexpected keyword argument 'codec'
这与第一个问题有关吗?
谢谢!