python - How to read line by line in pdf file and create a CSV

Question

Here is my pdf enter image description here I found THIS and I used it to scrap my pdf.

6 BEDROOMS
NameAddressUnitSizeKeyRentSq FtMove in DateNotesTenant
Prop #
Texan 261009 West 26th3076x3$4,6952,1368/15/14$1,000 Bonus (1) Park -

Its pretty mixed up. or Is is because the PDF is formatted in a way which is unreadable? I thought there was a way I could scrap each row and create a CSV with the columns by iteration or something.

Like populate a CSV with columns

T26 | Texan 26          | 1009 West 26th | 307      | 6x3 | ... 
e075| Texan North Campus| 5117 N Lamar   |See below | 6x3 |...

Is there a way around this?

score 0 · Accepted Answer

您使用的代码片段提供了一些实际上无法使用的数据，我认为这不是要走的路。从 PDF 中抓取通常相当困难，但是请查看 pdftables.com：它们提供了一个 API，用于从 PDF 文档中抓取表格，我发现在大多数情况下都有效 - 这是你最好的机会，我会说.

score 0 · Accepted Answer

您可以使用 Camelot（它是一个 Python 库）创建一个脚本，从 PDF 中提取表格数据并将其导出为 CSV。您可以在以下位置查看文档：http ://camelot-py.readthedocs.io 。如果您可以发布指向您的 PDF 的链接，将会很有帮助。这是一个通用代码示例：

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_csv('file.csv')

免责声明：我是图书馆的作者。

python - How to read line by line in pdf file and create a CSV

2 回答 2

Related

Reference