python - 无法使用 camelot 阅读 pdf

Question

我曾经camelot读过一个pdf文件，但我只能得到它的一部分。

如何阅读所有页面？

import camelot
import pandas as pd
tables = camelot.read_pdf('data.pdf', pages='all', flavor = 'stream')
df = tables[0].df

结果df是

                                              0            1  \
0                                                               
1   Land Parcel                                   City          
2                                                               
3                                                               
4   Land Parcel No. CTP-1813                      Cangzhou 滄州   
5   .\n.\n.\n.\n.\n.\n.\n.\n.\n.\nCTP-1813 號地塊 .                
6   Land Parcel No. 2018GC22026                   Beihai 北海     
7   .\n.\n.\n.\n.\n.\n.\n2018GC22026 號地塊.                       
8                                                               
9                                                               
10                                                              
11                                                              
12  Land parcels A, B, C and D for                Guigang 貴港    
13  the commercial and residential                              
14  project\nin Station Plaza at                                

                      2          3          4  
0                                   Land       
1   Land Use             Site Area  Premium    
2                                   (RMB       
3                        (sq.m.)    thousand)  
4   Commercial and       97,407.3   759,400    
5   residential                                
6   Wholesale,\nretail,  159,878.4  1,067,260  
7   residential,                               
8   catering,                                  
9   commercial and                             
10  financial and                              
11  residential                                
12  Commercial and       139,600.2  631,870    
13  residential                                
14

我还尝试了表格，其中包括更多结果，但仍然不是全部。

score 2 · Accepted Answer

您可以尝试使用以下代码，使用参数 table_areas 指定表边界：

tables=camelot.read_pdf("data.pdf", pages='1',flavor='stream',table_areas=['0,800,800,0'])

更多信息，请访问https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas

score 0 · Accepted Answer

不知道为什么camelot不起作用。改用pdfminer 。适用于您的样品：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

pdf_rm = PDFResourceManager()
with StringIO() as s:
    with TextConverter(pdf_rm, s, laparams=LAParams()) as d:
        with open('data.pdf', 'rb') as f:
            interpreter = PDFPageInterpreter(pdf_rm, d)
            for page in PDFPage.get_pages(f):
                interpreter.process_page(page)
            text = s.getvalue()
        s.close()

print(text)

输出：

Land Parcel

City

Land Use

Site Area

Land Parcel No. CTP-1813

CTP-1813 號地塊 . . . . . . . . . . .

Land Parcel No. 2018GC22026

2018GC22026 號地塊. . . . . . . .

Land parcels A, B, C and D for the commercial and residential project in Station Plaza at Guigang City 貴港市高鐵站前廣場商住項目 A、B、C及D地塊 . . . . . . . . . . .

Land Parcel No. 201821 and

No. 201822 為201821號及201822 號地塊. . Land Parcel No. QZ(18)049 and

No. QZ(18)050 QZ(18)049號和QZ(18)050號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel

No. 630102102006GB00321 630102102006GB00321 號地塊 . . . . . . . . . . . . . . . . . . . .

Land Parcel No. Xing Zheng

Chu (2018)45-1 滎政儲(2018)45-1號地塊 . . . . .

Land Parcel

No. XH2018GC012-1, No. XH2018GC012-2 and No. XH2018GC012-3 XH2018GC012-1號、 XH2018GC012-2號和 XH2018GC012-3號地塊. . . . . .

Land Parcel No. 2018-52

2018-52號地塊 . . . . . . . . . . . . .

Land Parcel B No. Yan

J[2018]Z003 of the Xikou Old Residence Renovation 煙J[2018]Z003號西口舊居改造 B地塊. . . . . . . . . . . . . . . . . . . . .

of Guihuang Road in Chengxin District 靈川縣城新區桂黃公路東側地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel No. BS18-1J-307

BS18-1J-307號地塊 . . . . . . . . .

Land Parcel No. Jing Tu Zheng

Chu Gua (Shun) [2018]043 京土整儲掛(順)[2018]043號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land

Premium

(RMB

thousand) 759,400

Cangzhou 滄州 Commercial and

(sq.m.) 97,407.3

Beihai 北海

residential

Wholesale, retail,

159,878.4

1,067,260

residential, catering, commercial and financial and residential

Guigang 貴港 Commercial and

residential

139,600.2

631,870

Yancheng 鹽城 Commercial and

167,738.0

339,400

residential

Guiyang 貴陽 Commercial and

117,023.0

342,050

residential

Xining 西寧

Commercial and

77,075.5

404,635

residential

Xingyang 滎陽 Commercial

72,351.7

260,400

Taizhou 泰州

Commercial and

217,681.3

728,520

residential

Xuzhou 徐州

Residential

74,448.6

1,203,000

Yantai 煙臺

Residential,

107,015.1

205,776

commercial service, public management and public service

Commercial and

63,442.7

62,820

residential

Chongqing 重慶 Residential

136,246.3

238,700

Beijing 北京

Class-2

69,856.0

2,330,000

residential, institutional pension facilities and basic educational

– 4 –

Land Parcel located to the east

Guilin 桂林

python - 无法使用 camelot 阅读 pdf

2 回答 2

Related

Reference