我已经使用以下代码刮掉了这个PDFTabula
,并创建了一个 (1410) 表列表:multiple_tables=True
from tabula import read_pdf
df = read_pdf("~/Google Drive/DATA/978-1-912036-41-7-Who Owns Whom UK-Ireland-Volume-1.pdf",
stream=True, pages='19-1428', guess=False, pandas_options={'header': None}, encoding = 'ISO-8859-1',
multiple_tables=True, columns=[210, 400], area=[60, 30, 835, 1000])
示例第一个表:
df[0000]
0 ... 2
0 NaN ... . Contrarian Group Limited England
1 NaN ... . Cornerstone Study Abroad Limited England
2 ? ... . Crudolife Limited England
3 NaN ... . Crystal Palace Physio Group Limited England
4 NaN ... . Daniels London Limited England
.. ... ... ...
140 . . Kharis Catering C.I.C. England ... . Pivotal Technologies Limited England
141 . . Kvm Limited England ... . Plus Black Limited England
142 . . London College Limited England ... . Plus Tyres Limited England
143 . . London College of Accounting & Finance Lim... ... . Portaplay Limited England
144 . .Millfield & Partners Ltd England ... . Portaplaypen Limited England
[145 rows x 3 columns]
问题
如何首先连接(一个在另一个之上)每个表中的三列以获得一个单列,然后将 1410 个表连接成一个表?
我设法遍历表列表并打印一列,但我不能将结果放入数据帧中:
import numpy as np
import panda as pd
for x, res in enumerate(df):
print(np.ravel(res)[None].T)
我试过这个:
for x, res in enumerate(df):
v = np.ravel(res)[None].T
result = pd.DataFrame(x,":",v,columns=['t'])