python - 如何使用 Python 解析从带有分隔符的 PDF 文件中提取的文本？

问问题 2017-09-24T10:51:04.597

3265 次

我曾尝试 PyPDF2 使用以下代码段从 PDF 中提取和解析文本；

import PyPDF2
import re

pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")

案例 1：当我尝试解析 pdf 文本时，我未能完全按照它们在 pdf 中出现的方式解析它们。例如，

rawText在这种情况下，在orextractedText和结果中都找不到换行符或换行符，如下所示 -

    input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.

案例2：对于以下案例，

它给出的结果为-

2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11

这更难以解析和区分这些单独的分数。是否可以使用 PyPDF2 或任何其他 Python 库完美解析这些场景？

python - 如何使用 Python 解析从带有分隔符的 PDF 文件中提取的文本？

0 回答 0

Related

Reference