我正在使用 camelot 从 pdf 文档中提取表格。该表具有日期、描述、贷方、借方和余额字段。描述字段有时有很长的行,延伸到下一行。如果我使用 camelot,它会打印如下所示的行:
Transaction Date Description Debit Credit Balance
2 01/11/2020 BAL B/F 38,485.30 38,485.30
3 02/11/2020 20,000.00 18,485.30
4 MB X WITHDRAWAL
5 Ref. MP:V TO X NO:5MP:V TO
6 X NO:9
7 04/11/2020 MB X WITHDRAWAL 20,000.00 98,485.30
8 Ref. MP:V TO X NO:40MP:V TO
9 X NO:47
10 05/11/2020 MB X WITHDRAWAL 80,000.00 18,485.30
我希望表格以这样一种方式出现,即“描述”字段下的行如果延伸到下一行,则组合成一行,例如:
Transaction Date Description Debit Credit Balance
2 01/11/2020 BAL B/F 38,485.30 38,485.30
3 02/11/2020 MB X WITHDRAWAL Ref. MP:V TO X NO:5MP:V TO X NO:9 20,000.00 18,485.30
这是我的代码:
tables = camelot.read_pdf('D:\\test.pdf', flavor='stream', edge_tol=500)
print(tables[0].df)
我如何实现这一目标?