python - 从pdf中提取数据的最佳方法是什么

Question

我有数千个 pdf 文件需要从中提取数据。这是一个示例pdf。我想从示例 pdf 中提取此信息。

我对 nodejs、python 或任何其他有效方法持开放态度。我对python和nodejs知之甚少。我尝试在这段代码中使用 python

import PyPDF2

try:
   pdfFileObj = open('test.pdf', 'rb')
   pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
   pageNumber = pdfReader.numPages
   page = pdfReader.getPage(0)
   print(pageNumber)

   pagecontent = page.extractText()
   print(pagecontent)
except Exception as e:
   print(e)

但我被困在如何查找采购历史记录上。从 pdf 中提取采购历史的最佳方法是什么？

score 2 · Accepted Answer

pdfplumber is the best option. [Reference]

Installation

pip install pdfplumber

Extract all the text

import pdfplumber
path = 'path_to_pdf.pdf'
with pdfplumber.open(path) as pdf:
    for  page  in pdf.pages:
        print(page.extract_text())

score 2 · Accepted Answer

I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.

To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.

A quick and dirty summary:

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.

I hope it helps!

EDIT: comment code

EDIT2: The following code is printing your data

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();

score 0 · Accepted Answer

我为制作 PDFTables 的公司工作。PDFTables API 将帮助您解决这个问题，并一次转换所有 PDF。它是一个简单的基于 Web 的 API，因此可以从任何编程语言调用。您需要在 PDFTables.com 创建一个帐户，然后使用此处示例语言之一的脚本：https ://pdftables.com/pdf-to-excel-api 。这是一个使用 Python 的示例：

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

该脚本在文件夹中查找扩展名为“.pdf”的所有文件，然后将每个文件转换为 XLSX 格式。您可以将格式更改为“.csv”、“.html”或“.xml”。前 75 页是免费的。

score 0 · Accepted Answer

好的。我从 opait.com 帮助开发这个商业产品。我接受了您输入的 PDF，并在其中划分了几个区域，如下所示：

还有你的桌子：

在大约 2 分钟内，我可以从这个和 1000 个类似的文档中提取它。请注意，此图像是日志视图，并将该数据导出为 CSV。所有蓝色“链接”都是提取的实际数据，并实际链接回 PDF，以便您查看来自哪里。输出也可以是 XML 或 JSON 或其他格式。您在该屏幕截图中看到的是日志视图，所有这些都是 CSV 格式（一个用于主要属性，另一个用于通过记录 ID 链接的每个表，如果您有一个 PDF，其中一个 PDF 中有 1000 个这些文档） .

同样，我帮助开发此产品，但您要求的可以完成。我提取了您的整个表格以及所有其他重要的字段。

score 0 · Accepted Answer

这是IntelliGet中的四行脚本

{ start = IsSubstring("CAGE   Contract Number",Line(-2));  
  end = IsEqual(0, Length(Line(1)));
  { start = 1;
    output = Line(0);
  }
}

score 0 · Accepted Answer

我工作的公司PDFTron有一个全自动的 PDF 到 HTML 输出解决方案。

你可以在这里在线试用。 https://www.pdftron.com/pdf-tools/pdf-table-extraction

这是您提供的文件的 HTML 输出的屏幕截图。输出包含两个 HTML 表格，以及介于两者之间的可重排文本内容。

输出是标准的 XML HTML，因此您可以轻松解析/操作 HTML 表。

python - 从pdf中提取数据的最佳方法是什么

6 回答 6

Related

Reference