python - 如何使用 Python 抓取 PDF；仅限特定内容

Question

我正在尝试从网站上可用的 PDF 中获取数据

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

例如，如果我查看 2019 年 11 月的报告

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

我需要第 12 页上的玉米数据，我必须为期末库存、出口等创建单独的文件。我是 Python 新手，不知道如何单独抓取内容。如果我能弄清楚一个月，那么我可以创建一个循环。但是，我对如何处理一个文件感到困惑。

有人可以帮我吗，TIA。

score 6 · Accepted Answer

这里有一个使用 PyPDF2、requests 和 BeautifulSoup 的小例子...请检查注释注释，这是第一个块...如果您需要更多，则需要更改 url 变量中的值

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)

score 1 · Accepted Answer

如果您需要从网站上抓取数据，我会推荐 Beautiful Soup，但看起来您将需要 OCR 来从 PDF 中提取数据。有一种叫做 pytesseract 的东西。看看那个和教程，你应该准备好了。

score 0 · Accepted Answer

试试pdfreader。您可以将表格提取为包含解码文本字符串的 PDF 降价，然后解析为纯文本。


from pdfreader import SimplePDFViewer
fd = open("latest.pdf","rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(12)
viewer.render()
markdown = viewer.canvas.text_content

markdown变量包含所有文本，包括 PDF 命令（定位、显示）：所有字符串都放在括号中，后跟Tj或TJ运算符。有关 PDF 文本运算符的更多信息，请参阅PDF 1.7 秒。9.4 文本对象

例如，您可以使用正则表达式对其进行解析。

python - 如何使用 Python 抓取 PDF；仅限特定内容

3 回答 3

Related

Reference