python - 尝试除了 IndexError - 我没有得到想要的结果

Question

我正在尝试阅读 PDF 文件并将它们转换为 Python 中的干净数据框。我遍历所有相关页面，并希望逐步附加数据框以获得包含所有信息的大表。

第 32-33 页需要与其他页面稍有不同的处理（否则会引发 IndexError）。我试图通过使用 try-except 来解决这个问题。但是，运行代码后，生成的数据帧 ledig['2000'] 中缺少第 32-33 页的信息。

我试图单独执行 except 块中的代码并且它可以工作（如果我只阅读 pp.32-33）。

有任何想法吗？

当我第一次使用 try-except 时，我当然有可能以某种方式误解了这个概念。

我的代码：

import camelot
ledig = {}
d = 2000
df_name = str(d)
tables = camelot.read_pdf('https://www.estv.admin.ch/dam/estv/de/dokumente/allgemein/Dokumentation/Zahlen_fakten/Steuerstatistiken/steuerbelastung_gemeinden/'+str(d)+'/BAE/Bruttoarbeitseinkommen%20Lediger.pdf.download.pdf/'+str(d)+'_bruttoarbeit_lediger_'+str(d)+'.pdf', pages="2-end", flavor='stream')
j = tables.n - 1
ledig[df_name] = pd.DataFrame()
for i in range(0,j):
    try:
        row = tables[i].df[tables[i].df.iloc[:,1] == '20'].index.tolist() #look for value "20", we want to move that to the top and delete rows above
        df = tables[i].df[row[0]:]
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:] #take the data less the header row
        df.columns = new_header #set the header row as the df header
        df = df.replace('-','0')
        df.iloc[:, 1:] = df.iloc[:, 1:].apply(pd.to_numeric)  
        ledig[df_name] = ledig[df_name].append(df)
        ledig[df_name] = ledig[df_name].dropna()
        ledig[df_name].drop_duplicates(keep=False,inplace=True) 
    except IndexError:
        row = tables[i].df[tables[i].df.iloc[:,2] == '20'].index.tolist() #look for value "20", we want to move that to the top and delete rows above
        df = tables[i].df[row[0]:]
        df = df.drop(df.columns[[1,3]], axis=1) 
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:] #take the data less the header row
        df.columns = new_header #set the header row as the df header
        df = df.replace('-','0')
        df.iloc[:, 1:] = df.iloc[:, 1:].apply(pd.to_numeric)  
        df.fillna(0, inplace = True)  
        ledig[df_name] = ledig[df_name].append(df)
        ledig[df_name] = ledig[df_name].dropna()
        ledig[df_name].drop_duplicates(keep=False,inplace=True)

score 1 · Accepted Answer

您对 try/except 的使用是正确的。

问题在于df = df.drop(df.columns[[1,3]], axis=1)：您不应该删除第 4 列 (3)。

如果使用df = df.drop(df.columns[[1]], axis=1)，则正确附加第 32 页和第 33 页中的表格。

python - 尝试除了 IndexError - 我没有得到想要的结果

1 回答 1

Related

Reference