python - 如何从pdf文件中获取内容并将其存储在txt文件中

Question

import pyPdf 
f= open('jayabal_appt.pdf','rb')
pdfl = pyPdf.PdfFileReader(f)
content=""
for i in range(0,1):
   content += pdfl.getPage(i).extractText() + "\n"
outpu = open('b.txt','wb')
outpu.write(content) 
f.close()
outpu.close()

这不是从pdf文件中获取内容并将其存储在txt文件中......这段代码有什么错误？？？？

score 1 · Accepted Answer

A simple example from the author suggest doing this (You don't seem to be doing 'file'):

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("jayabal_appt.pdf", "rb"))

Then you can do the following:

output.addPage(input1.getPage(0))

And sure, use a for loop for it, but the author doesn't suggest using extractText.

Just check out the website, the example is rather straight forward: http://pybrary.net/pyPdf/

However

pyPdf is no longer maintained, so I don't recommend using it. The author suggest to check out pyPdf2 instead.

A simple Google search also suggest that you should try pdftotext or pdfminer. There are plenty of examples out there.

Good luck.

python - 如何从pdf文件中获取内容并将其存储在txt文件中

1 回答 1

Related

Reference