0

Here is my pdf enter image description here I found THIS and I used it to scrap my pdf.

6 BEDROOMS
NameAddressUnitSizeKeyRentSq FtMove in DateNotesTenant
Prop #
Texan 261009 West 26th3076x3$4,6952,1368/15/14$1,000 Bonus (1) Park -     

Its pretty mixed up. or Is is because the PDF is formatted in a way which is unreadable? I thought there was a way I could scrap each row and create a CSV with the columns by iteration or something.

Like populate a CSV with columns

T26 | Texan 26          | 1009 West 26th | 307      | 6x3 | ... 
e075| Texan North Campus| 5117 N Lamar   |See below | 6x3 |...

Is there a way around this?

4

2 回答 2

0

您使用的代码片段提供了一些实际上无法使用的数据,我认为这不是要走的路。从 PDF 中抓取通常相当困难,但是请查看 pdftables.com:它们提供了一个 API,用于从 PDF 文档中抓取表格,我发现在大多数情况下都有效 - 这是你最好的机会,我会说.

于 2014-09-17T16:48:34.043 回答
0

您可以使用 Camelot(它是一个 Python 库)创建一个脚本,从 PDF 中提取表格数据并将其导出为 CSV。您可以在以下位置查看文档:http ://camelot-py.readthedocs.io 。如果您可以发布指向您的 PDF 的链接,将会很有帮助。这是一个通用代码示例:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_csv('file.csv')

免责声明:我是图书馆的作者。

于 2018-11-09T18:49:40.240 回答