1

I want to extract data from pdf files. I'm using pdfminer tool pdf2txt to convert pdf into plain text. But The text file produced has messed up the order of data( wherever table encountered and after it also). I then tried cnverting pdf to html but, alas, same results. I am new to python... and also I couldn't understand the extensive working of pdfminer library. Is there any way to preserve the order of data ?

4

1 回答 1

2

Try running the script with these additional args: -M 30 -W .95 -L .03

I have had the same problem as you described, and this improved the output a lot. However, I get much better results with pdftotext.exe, part of xpdf. Download it here:

http://www.foolabs.com/xpdf/download.html

Mike

于 2012-07-26T00:32:47.987 回答