我正在使用 pdfplumber 来提取一些 pdf 文档的内容。
import pandas as pd
import pdfplumber
# files information
inputpath = r'myPath'
fileslist = inputpath + r'filesList'
OutputCSV = inputpath + r'output' + r'.csv'
df = pd.read_excel(fileslist)
pdfDoc = []
pdfText = []
all_text = []
fullpath = df[df['file_type'] == '.pdf']['full_path'].tolist()
for i in range(len(fullpath)):
print(i)
with pdfplumber.open(fullpath[i]) as pdf:
for page in pdf.pages:
text = page.extract_text()
if str(type(text)) == "<class 'NoneType'>":
continue
all_text += text
pdfText.append(all_text)
pdfDoc.append(fullpath)
df1 = pd.DataFrame({'full_path': pdfDoc,'content': pdfText})
df.to_csv(OutputCSV)
在 i=30 我收到以下错误:
文件“C:\ProgramData\Anaconda3\lib\site-packages\pdfminer\cmapdb.py”,第 117 行,在 decode return struct.unpack('>%dH' % n, code)
错误:解包需要 230 字节的缓冲区
有什么办法可以解决这个问题?