1

我正在尝试从 pdf 格式的表格中获取每月信息。我编写了一个运行良好的代码,但在某些特定月份出现错误。我所做的是这样的:

def provincias(codigo,pages:list):
    box = [4,2.8,19,27]
    fc = 28.28
    for i in range(0, len(box)):
      box[i] *= fc

    path = "http://www.bcra.gob.ar/Pdfs/PublicacionesEstadisticas/BoletinEstadistico/" + 'boldat' + str(codigo) +".pdf"
    tables = read_pdf(path, pages=pages, area=[box], stream=True)

    tablas1 = pd.DataFrame()
    tablas2 = pd.DataFrame()

    for page in pages:
      if page % 2 != 0:
          tablas1 = pd.concat(tables)
          tablas1 = tablas1.drop(columns=['(6)(7)','Unnamed: 1','Unnamed: 2','Unnamed: 4'])
          tablas1 = tablas1.rename(columns={'Unnamed: 0':'Actividad','Unnamed: 3':'Capital Federal','Aires':'Gran Buenos Aires','Unnamed: 5':'Resto Bs As',
                                          'Unnamed: 6':'Catamarca','Unnamed: 7':'Cordoba','Unnamed: 8':'Corrientes','Unnamed: 9':'Chaco','Unnamed: 10':'Chubut',
                                          'Unnamed: 11':'Entre Rios','Unnamed: 12':'Formosa','Unnamed: 13':'Jujuy'})
          tablas1['ID'] = range(len(tablas1))
          return tablas1

      else:
        for tabla in tables:
          if len(tabla.columns) <= 17:
            tabla = tabla.drop(columns=['(6)(7)','Unnamed: 1'])
            tabla = tabla.rename(columns={'Unnamed: 0':'Actividad','Unnamed: 2':'La Pampa','Unnamed: 3':'La Rioja','Unnamed: 4':'Mendoza',
                                            'Unnamed: 5':'Misiones','Unnamed: 6':'Neuquen','Unnamed: 7':'Rio Negro','Unnamed: 8':'Salta',
                                            'Unnamed: 9':'San Juan','Unnamed: 10':'San Luis','Unnamed: 11':'Santa Cruz','Unnamed: 12':'Santa Fe',
                                            'Estero':'Santiago del Estero','Fuego':'Tierra del Fuego','Unnamed: 13':'Tucuman'})
            tabla = pd.DataFrame(tabla)
            tablas2 = tablas2.append(tabla)

          elif len(tabla.columns) > 17:
            tabla = tabla.drop(columns=['(6)(7)','Unnamed: 1','Unnamed: 13'])
            tabla = tabla.rename(columns={'Unnamed: 0':'Actividad','Unnamed: 2':'La Pampa','Unnamed: 3':'La Rioja','Unnamed: 4':'Mendoza',
                                          'Unnamed: 5':'Misiones','Unnamed: 6':'Neuquen','Unnamed: 7':'Rio Negro','Unnamed: 8':'Salta',
                                          'Unnamed: 9':'San Juan','Unnamed: 10':'San Luis','Unnamed: 11':'Santa Cruz','Unnamed: 12':'Santa Fe',
                                          'Estero':'Santiago del Estero','Fuego':'Tierra del Fuego','Unnamed: 14':'Tucuman'})
            tabla = pd.DataFrame(tabla)
            tablas2 = tablas2.append(tabla)
          
            tablas2 = tablas2.drop(columns=['Actividad'])
            tablas2['ID'] = range(len(tablas2))
            return tablas2

然后,如果我运行这样的代码,它会很好用:

enero2020 = provincias(202001, pages=[307,309]).merge(provincias(202001, pages=[308,310]), how='left', on='ID').drop(columns=['ID'])
febrero2020 = provincias(202002, pages=[307,309]).merge(provincias(202002, pages=[308,310]), how='left', on='ID').drop(columns=['ID'])

但是我对六月的信息有疑问。如果我运行这个:

junio2021 = provincias(202106, pages=[313,315]).merge(provincias(202106, pages=[314,316]), how='left', on='ID').drop(columns=['ID'])

它给了我标题中的错误:只能合并 Series 或 DataFrame 对象,传递了一个 <class 'NoneType'> 。我可以看到问题出在合并部分,但我尝试了很多东西并无法解决它。

我看到在“junio2021”的情况下,两个页面 [314,316] 的列长度相同,但我不知道这是否是问题所在。

谢谢!!

4

1 回答 1

2

似乎 2021 年 6 月的数据不需要那个,if len(tabla.columns) <= 17:因为这次 tabula 得到了所有列。

在这种情况下,由于pandas 合并中的参数,不需要创建'ID'列。left_index=True, right_index=True

它是这样的:

import pandas as pd
from tabula.io import read_pdf

def provincias(codigo,pages:list):
  box = [4,2.8,19,27]
  fc = 28.28
  for i in range(0, len(box)):
    box[i] *= fc

  path = "http://www.bcra.gob.ar/Pdfs/PublicacionesEstadisticas/BoletinEstadistico/" + 'boldat' + str(codigo) +".pdf"
  tables = read_pdf(path, pages=pages, area=[box], stream=True)

  df = pd.concat(tables).reset_index(drop=True)
  
  if pages[0] % 2 != 0:
    df = df.drop(columns=['(6)(7)','Unnamed: 1','Unnamed: 2','Unnamed: 4'])
    df = df.rename(columns={'Unnamed: 0':'Actividad','Unnamed: 3':'Capital Federal','Aires':'Gran Buenos Aires','Unnamed: 5':'Resto Bs As',
                            'Unnamed: 6':'Catamarca','Unnamed: 7':'Cordoba','Unnamed: 8':'Corrientes','Unnamed: 9':'Chaco','Unnamed: 10':'Chubut',
                            'Unnamed: 11':'Entre Rios','Unnamed: 12':'Formosa','Unnamed: 13':'Jujuy'})
  else:
    df = df.drop(columns=['(6)(7)','Unnamed: 0','Unnamed: 1'])
    df = df.rename(columns={'Unnamed: 2':'La Pampa','Unnamed: 3':'La Rioja','Unnamed: 4':'Mendoza',
                            'Unnamed: 5':'Misiones','Unnamed: 6':'Neuquen','Unnamed: 7':'Rio Negro','Unnamed: 8':'Salta',
                            'Unnamed: 9':'San Juan','Unnamed: 10':'San Luis','Unnamed: 11':'Santa Cruz','Unnamed: 12':'Santa Fe',
                            'Estero':'Santiago del Estero','Fuego':'Tierra del Fuego','Unnamed: 13':'Tucuman'})
  return df

我尝试了其他几个月,但 tabula 永远不会正确,因此您需要使用以前的代码。

2021 年 6 月:

junio2021 = provincias(202106, pages=[313,315]).merge(provincias(202106, pages=[314,316]), left_index=True, right_index=True)
于 2021-08-23T20:08:01.910 回答