0

我已经使用以下代码刮掉了这个PDFTabula ,并创建了一个 (1410) 表列表:multiple_tables=True

from tabula import read_pdf
df = read_pdf("~/Google Drive/DATA/978-1-912036-41-7-Who Owns Whom UK-Ireland-Volume-1.pdf",
              stream=True, pages='19-1428', guess=False, pandas_options={'header': None},  encoding = 'ISO-8859-1', 
              multiple_tables=True, columns=[210, 400], area=[60, 30, 835, 1000])

示例第一个表:

df[0000]
                                                     0  ...                                              2
0                                                  NaN  ...             . Contrarian Group Limited England
1                                                  NaN  ...     . Cornerstone Study Abroad Limited England
2                                                    ?  ...                    . Crudolife Limited England
3                                                  NaN  ...  . Crystal Palace Physio Group Limited England
4                                                  NaN  ...               . Daniels London Limited England
..                                                 ...  ...                                            ...
140                 . . Kharis Catering C.I.C. England  ...         . Pivotal Technologies Limited England
141                            . . Kvm Limited England  ...                   . Plus Black Limited England
142                 . . London College Limited England  ...                   . Plus Tyres Limited England
143  . . London College of Accounting & Finance Lim...  ...                    . Portaplay Limited England
144                . .Millfield & Partners Ltd England  ...                 . Portaplaypen Limited England
[145 rows x 3 columns]

问题

如何首先连接(一个在另一个之上)每个表中的三列以获得一个单列,然后将 1410 个表连接成一个表?

我设法遍历表列表并打印一列,但我不能将结果放入数据帧中:

import numpy as np
import panda as pd
for x, res in enumerate(df):
    print(np.ravel(res)[None].T)

我试过这个:

for x, res in enumerate(df):
    v = np.ravel(res)[None].T
    result = pd.DataFrame(x,":",v,columns=['t'])
4

0 回答 0