python - 如果找不到特定字符串，如何遍历数据帧列表并删除所有数据

Question

我正在使用 python 库 Camelot 来解析多个 PDF 并提取这些 PDF 文件中的所有表。第一行代码以列表格式返回从 pdf 中抓取的所有表格。我正在寻找一个特别是其中包含唯一字符串的表。谢天谢地，这个字符串是这个表唯一的，所以理论上我可以用它来隔离我想要抓取的表。

这些 pdf 或多或少是以相同的格式创建的，但是有足够的差异，我不能只在我想要的表上进行静态调用。例如，有时我想要的表会是第一个被刮掉的表，有时会是第三个。因此，我需要编写一些代码才能动态选择表格。

我脑海中的工作流程逻辑上是这样的：

在 for 循环之前创建一个空列表以将表附加到。调用 for 循环并遍历 Camelot 代码输出的列表中的每个表。如果表中没有我要查找的字符串，请删除该表中的所有数据，然后将空数据框附加到空列表中。如果它确实有我要查找的字符串，请将其附加到空列表而不删除任何内容。

有没有更好的方法来解决这个问题？我确定可能有。

我已经把我到目前为止放在一起的东西放在我的代码中。如果字符串存在，我正在努力组合一个条件语句来删除数据帧的所有行。如果字符串存在，我发现了很多删除列和行的例子，但整个数据框没有

import camelot
import pandas as pd

#this creates a list of all the tables that Camelot scrapes from the pdf
tables = camelot.read_pdf('pdffile', flavor ='stream', pages = '1-end')

#empty list to append the tables to
elist = []

for t in tables:
    dftemp = t.df

    #my attempt at dropping all the value if the unique value isnt found. THIS DOESNT WORK
    dftemp[dftemp.values  != "Unique Value", dftemp.iloc[0:0]]

    #append to the list
    elist.append(dftemp)

#combine all the dataframes in the list into one dataframe
dfcombined = pd.concat(elist)

score 3 · Accepted Answer

您可以在 dftemp.values 链接返回的 numpy 数组上使用“in”运算符

for t in tables:
    dftemp = t.df

    #my attempt
    if "Unique Value" in dftemp.values:
        #append to the list
        elist.append(dftemp)

score 1 · Accepted Answer

您可以在一行中执行此操作：

dfcombined = pd.concat([t.df if "Unique Value" in t.df.values else pd.DataFrame() for t in tables ])

python - 如果找不到特定字符串，如何遍历数据帧列表并删除所有数据

2 回答 2

Related

Reference