-1

我有一个通过网络抓取下载的 PDF 文件的数据库。我可以从这些 PDF 文件中提取表格并在 jupyter notebook 中将它们可视化,如下所示:

import os
import camelot.io as camelot
n = 1

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item, pages='all', split_text=True)
    print(f'''DATENBLATT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        print(tabs.df, "\n==============================================================================\n")

通过这种方式,我得到了数据库中两个 PDF 文件的结果,如下所示。

PDF1PDF2

现在我想问我如何才能从包含例如“电压”和“电流”信息的表中获取特定数据。更具体地说,我想提取用户定义或目标信息并使用此值制作图表,而不是整体打印它们。

提前致谢。

DATENBLATT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2972
1          Voltage Nominal                                              51.8V
2    Voltage Range Min/Max                                        43.4V/58.1V
3           Charge Current  160A maximum \nDe-rated by BMS message over CA...
4        Discharge Current  300A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                    5.76kWh/111.4Ah
6   Maximum Energy Density                                           164Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                       W: 243 x L: 352 x H: 300.5mm
9                   Weight                                               37kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported Information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack Behaviour   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages, Pack Current, ...  
2  Interlock to control external protection devic...  
3          Actively controlled dissipative balancing  
4  BMS implements a single master and multi-slave...  
5  Zivan, Victron, Delta-Q, TC-Charger, SPE. For ...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 10  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================

DATENBLATT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2889
1          Voltage Nominal                                              44.4V
2    Voltage Range Min/Max                                        37.2V/49.8V
3           Charge Current  132A maximum \nDe-rated by BMS message over CA...
4        Discharge Current  132A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                      4.94kWh/111Ah
6   Maximum Energy Density                                           152Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                         W: 243 x L: 352 x H: 265mm
9                   Weight                                               32kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported Information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack Behaviour   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages, Pack Current, ...  
2  Interlock to control external protection devic...  
3          Actively controlled dissipative balancing  
4  BMS implements a single master and multi-slave...  
5  Zivan, Delta-Q, TC-Charger, SPE, Victron, Bass...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 12  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================
4

1 回答 1

0

您可以定义感兴趣的字符串列表;

然后只选择至少包含这些字符串之一的表。

import os
import camelot.io as camelot
n = 1

# define your strings of interest
interesting_strings=["voltage", "current"]

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item, pages='all', split_text=True)
    print(f'''DATENBLATT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        # select only tables which contain at least one of the interesting strings
        if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
            print(tabs.df, "\n==============================================================================\n")

如果您只想在特定位置(例如,在第一列)搜索有趣的字符串,您可以使用 Pandas 数据框属性,例如iloc

any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)
于 2021-05-11T08:33:36.770 回答