I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered situations where half of the text could not be extracted, depending on the file format. Here's my full code:
import io
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
def convert_pdf(pdfFile, retstr):
password = ''
pagenos = set()
maxpages = 0
laparams = LAParams()
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdfFile, pagenos, maxpages=maxpages, password=password, check_extractable=True)
device.close()
return retstr
def extract_pdf(file_name, language):
pdfFile = open(file_name, 'rb')
retstr = io.StringIO()
retstr = convert_pdf(pdfFile, retstr)
whole = retstr.getvalue()
original_texts = whole.split('\n')
pdfFile.close()
return original_texts
I wonder if there's a way to extract the full text using PDFMiner. I've heard of poppler, but I can't seem to find how to use it as a Python library. Besides, I don't want to use the command line. Can anyone help?
Here's an example: a thesis. Several paragraphs were lost when extracting using the code above. Like in the 2nd page, I could only extract first half of the page until "Pereira, Tishby, and Lee (1993)" at the middle. Then it just skip right to the next page for no apparent reason.