python - pyPDF2 中的 extractText() 函数抛出错误

Question

我正在尝试从 PDF 中提取文本以便进行分析，但是当我尝试从页面中提取文本时，我收到以下错误。

Traceback (most recent call last):
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt
    result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)

File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression
    result = eval(compiled, updated_globals, frame.f_locals)

File "<string>", line 1, in <module>

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText
    content = ContentStream(content, self.pdf)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__
    stream = StringIO(stream.getData())

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData
    decoded._data = filters.decodeStreamData(self)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData
    data = ASCII85Decode.decode(data)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode
    data = [y for y in data if not (y in ' \n\r\t')]

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp>
    data = [y for y in data if not (y in ' \n\r\t')]

TypeError: 'in <string>' requires string as left operand, not int

相关代码部分如下：

from PyPDF2 import PdfFileReader

for PDF_Entry in self.PDF_List:
    Pdf_File = PdfFileReader(open(PDF_Entry, "rb"))
    for pg_idx in range(0, Pdf_File.getNumPages()):
        page_Content = Pdf_File.getPage(pg_idx).extractText()
        for line in page_Content.split("\n"):
            self.Analyse_Line(line)

在 extractText() 行抛出错误。

score 2 · Accepted Answer

可能值得尝试最新版本的 PyPDF2，我写这篇文章时最新版本是 1.24。

话虽如此，我发现 extractText() 功能非常脆弱。它适用于某些文件，但对其他文件无效。查看一些未解决的问题：

https://github.com/mstamy2/PyPDF2/issues/180和https://github.com/mstamy2/PyPDF2/issues/168

我改用 Poppler 命令行实用程序 pdftotext 解决了这个问题，既可以将文档分类为图像与文本，又可以获取所有内容。对我来说非常稳定——我已经在数千个 PDF 文档上运行过它。根据我的经验，它还可以毫不费力地从受保护/加密的 PDF 中提取文本。

例如（为 Python 2 编写）：

def consult_pdftotext(filename):
    '''
    Runs pdftotext to extract text of pages 1..3.
    Returns the count of characters received.

    `filename`: Name of PDF file to be analyzed.
    '''
    print("Running pdftotext on file %s" % filename, file=sys.stderr)
    # don't forget that final hyphen to say, write to stdout!!
    cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
    pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    std_out, std_err = pdf_pipe.communicate()
    count = len(std_out)
    return count

高温高压

score 1 · Accepted Answer

你在一行中做两件事。尝试打破已经完成的事情以更接近问题。改变：

page_Content = Pdf_File.getPage(pg_idx).extractText()

进入

page = Pdf_File.getPage(pg_idx)
page_Content = page.extractText()

查看错误发生的位置。还要从命令行而不是从 Eclipse 运行程序，以确保它是相同的错误。你说它发生在extractText()但这条线没有出现在回溯中。

python - pyPDF2 中的 extractText() 函数抛出错误

2 回答 2

Related

Reference