python - Losing information when extracting text from PDF using PDFMiner

Question

I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered situations where half of the text could not be extracted, depending on the file format. Here's my full code:

import io
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


def convert_pdf(pdfFile, retstr):
    password = ''
    pagenos = set()
    maxpages = 0
    laparams = LAParams()
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdfFile, pagenos, maxpages=maxpages, password=password, check_extractable=True)
    device.close()
    return retstr


def extract_pdf(file_name, language):
    pdfFile = open(file_name, 'rb')
    retstr = io.StringIO()
    retstr = convert_pdf(pdfFile, retstr)
    whole = retstr.getvalue()
    original_texts = whole.split('\n')
    pdfFile.close()
    return original_texts

I wonder if there's a way to extract the full text using PDFMiner. I've heard of poppler, but I can't seem to find how to use it as a Python library. Besides, I don't want to use the command line. Can anyone help?

Here's an example: a thesis. Several paragraphs were lost when extracting using the code above. Like in the 2nd page, I could only extract first half of the page until "Pereira, Tishby, and Lee (1993)" at the middle. Then it just skip right to the next page for no apparent reason.

python - Losing information when extracting text from PDF using PDFMiner

0 回答 0

Related

Reference