谁能建议我如何使用 python/java 程序从 PDF 中提取表格数据,以获取 pdf 文件中存在的以下无边框表格?
问问题
1924 次
3 回答
2
这张桌子对 tabla 来说可能很困难。怎么用guess=False, stream=True
?
更新:从 tabula-py 1.0.3 开始,guess
应该stream
一起工作。无需设置guess=False
为使用stream
或lattice
选项。
于 2018-08-08T07:18:11.890 回答
0
Tabula-py 无边框表格提取:
Tabula-py 具有基于间隙检测表的 True流。
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")
于 2021-11-27T10:37:22.020 回答
0
我通过tabula-py
conda install tabula-py
和
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
我得到了这个结果
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
这是一个可迭代的 obj,因此您可以通过for row in page2:
希望对你有帮助
于 2021-07-06T10:54:19.953 回答