python - 尝试从 PDF 中提取时出现“Nonetype 对象不可迭代”

Question

我正在尝试从 PDF 中提取数据，但我不断收到类型错误，因为我的对象不可迭代（在声明中for line in text:但我不明白为什么“文本”没有价值，就在上面我使用创建文本对象text = page.extract.text()然后我想遍历文本的每一行以查找与我的正则表达式匹配的内容。

恐怕我的陈述for line in text:有问题；也许使用'line'是不合适的，但我不知道还能做什么。

我的代码如下，感谢观看！

import requests
import pdfplumber
import pandas as pd
import re
from collections import namedtuple

Line = namedtuple('Line', 'gbloc_name contact_type email')

gbloc_re = re.compile(r'^(?:a\.\s[A-Z]{5}\:\s[A-Z]{4})')

line_re = re.compile(r'^[^@\s]+@[^@\s]\.[^@\s]+$')

file = 'sampleReport.pdf'
  
lines=[]

with pdfplumber.open(file) as pdf:
    pages = pdf.pages 
    for page in pdf.pages: 
        text = page.extract_text() 
        for line in text: 
            gbloc = gbloc_re.search(line) 
            if gbloc:
                gbloc_name = gbloc

            elif line.startswith('Outbound'):
                contact_type = 'Outbound'
            
            elif line.startswith('Tracing'):
                contact_type = 'Tracing'
            
            elif line.startswith('Customer'):
                contact_type = 'Customer Service'

            elif line.startswith('QA'):
                contact_type = 'Quality Assurance'
            
            elif line.startswith('NTS'):
                contact_type = 'NTS'

            elif line.startswith('Inbound'):
                contact_type = 'Inbound'
            
            elif line_re.search(line):
                items = line.split()
                lines.append(Line(gbloc_name, contact_type, *items))

score 0 · Accepted Answer

我使用 lib PyPDF2从 PDF 中提取文本。在这里，我做了一个简单的源代码。它将按页面提取内容。

import PyPDF2

with open('example.pdf', 'rb') as pdfFileObj:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    print(pdfReader.numPages)
    for i in range(0, pdfReader.numPages):
        print("Page: ", i)
        pageObj = pdfReader.getPage(i)
        print(pageObj.extractText())

图像结果：

如果您有任何问题，请检查并回复我。

score 0 · Accepted Answer

尝试将循环直接设置为 page.extract_text() 值。像这样：

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text():

python - 尝试从 PDF 中提取时出现“Nonetype 对象不可迭代”

2 回答 2

Related

Reference