python - 使用 tabula-py 读取多个 PDF 页面时出错

Question

我正在尝试阅读一个多页 PDF 文件，该文件在每页的同一区域中包含一个表格。页数可能会根据正在读取的文件而变化。

我正在尝试下面的代码，但它不起作用：

import tabula
df = tabula.read_pdf("dados/nota.pdf", guess=False, stream=True, pages='all', encoding="utf-8", area=(238.00, 32.00, 400.00, 563.00))

返回错误：

ParserError                               Traceback (most recent call last)
~\Anaconda3\envs\alura_pandas\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
    171         try:
--> 172             return pd.read_csv(io.BytesIO(output), **pandas_options)
    173 

~\Anaconda3\envs\alura_pandas\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 

~\Anaconda3\envs\alura_pandas\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:

~\Anaconda3\envs\alura_pandas\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 

~\Anaconda3\envs\alura_pandas\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 12 fields in line 13, saw 13


During handling of the above exception, another exception occurred:

CSVParseError                             Traceback (most recent call last)
<ipython-input-3-f2350ca5dd21> in <module>
----> 1 df = tabula.read_pdf("dados/nota.pdf", guess=False, stream=True, pages='all', encoding="utf-8", area=(238.00, 32.00, 400.00, 563.00))

~\Anaconda3\envs\alura_pandas\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
    179             )
    180 
--> 181             raise CSVParseError(message, e)
    182 
    183 

CSVParseError: Error failed to create DataFrame with different column tables.
Try to set `multiple_tables=True`or set `names` option for `pandas_options`. 
, caused by ParserError('Error tokenizing data. C error: Expected 12 fields in line 13, saw 13\n',)

在 read_pdf 上，如果我将 pages='all' 更改为 pages=1、pages=2 等，它可以工作，但我需要指定必须读取所有页面，并且这个数字可以根据文件而改变。

有人对此有任何线索吗？

编辑：我设法通过插入 multiple_tables=True 参数来读取表格。现在的代码是这样的：

df = tabula.read_pdf("dados/nota.pdf", guess=False, stream=True, multiple_tables=True, pages='all', encoding="utf-8", area=(238.00, 32.00, 400.00, 563.00))

我得到这个结果：

[     0           1    2             3      4                        5   \
 0     Q  Negociação  C/V  Tipo mercado  Prazo  Especificação do título   
 1   NaN   1-BOVESPA    C         VISTA    NaN              ITAUSAPN N1   
 2   NaN   1-BOVESPA    C         VISTA    NaN       LOCAMERICAON EB NM   
 3   NaN   1-BOVESPA    C         VISTA    NaN       LOCAMERICAON EB NM   
 4   NaN   1-BOVESPA    C         VISTA    NaN            PETRORIOON NM   
 5   NaN   1-BOVESPA    C         VISTA    NaN            PETRORIOON NM   
 6   NaN   1-BOVESPA    C         VISTA    NaN                 SCHULZPN   
 7   NaN   1-BOVESPA    C         VISTA    NaN                 SCHULZPN   
 8   NaN   1-BOVESPA    C         VISTA    NaN           VULCABRASON NM   
 9   NaN   1-BOVESPA    C         VISTA    NaN           VULCABRASON NM   
 10  NaN   1-BOVESPA    C         VISTA    NaN           VULCABRASON NM   
 11  NaN   1-BOVESPA    C         VISTA    NaN           VULCABRASON NM   

           6           7    8               9                        10   11  
 0   Obs. (*)  Quantidade  NaN  Preço / Ajuste  Valor Operação / Ajuste  D/C  
 1        NaN         NaN  800           13,84                11.072,00    D  
 2        NaN         NaN  300           17,01                 5.103,00    D  
 3        NaN         NaN  200           17,01                 3.402,00    D  
 4        NaN         NaN  500           18,01                 9.005,00    D  
 5        NaN         NaN  100           18,01                 1.801,00    D  
 6        NaN         NaN  500            8,79                 4.395,00    D  
 7        NaN         NaN  700            8,78                 6.146,00    D  
 8        NaN         NaN  300            7,87                 2.361,00    D  
 9        NaN         NaN  300            7,87                 2.361,00    D  
 10       NaN         NaN  300            7,87                 2.361,00    D  
 11       NaN         NaN  200            7,87                 1.574,00    D  ,
      0           1    2             3      4                        5   \
 0     Q  Negociação  C/V  Tipo mercado  Prazo  Especificação do título   
 1   NaN   1-BOVESPA    V         VISTA    NaN          LOCAMERICAON NM   
 2   NaN   1-BOVESPA    V         VISTA    NaN          LOCAMERICAON NM   
 3   NaN   1-BOVESPA    V         VISTA    NaN          LOCAMERICAON NM   
 4   NaN   1-BOVESPA    V         VISTA    NaN          LOCAMERICAON NM   
 5   NaN   1-BOVESPA    V         VISTA    NaN          LOCAMERICAON NM   
 6   NaN   1-BOVESPA    V         VISTA    NaN            PETRORIOON NM   
 7   NaN   1-BOVESPA    C         VISTA    NaN           VULCABRASON NM   
 8   NaN   1-BOVESPA    V         VISTA    NaN           VULCABRASON NM   
 9   NaN   1-BOVESPA    V         VISTA    NaN           VULCABRASON NM   
 10  NaN   1-BOVESPA    V         VISTA    NaN           VULCABRASON NM   

           6           7   8               9                        10  11   12  
 0   Obs. (*)  Quantidade NaN  Preço / Ajuste  Valor Operação / Ajuste NaN  D/C  
 1        NaN         100 NaN           17,20                 1.720,00 NaN    C  
 2        NaN         100 NaN           17,20                 1.720,00 NaN    C  
 3        NaN         100 NaN           17,20                 1.720,00 NaN    C  
 4        NaN         100 NaN           17,20                 1.720,00 NaN    C  
 5        NaN         100 NaN           17,20                 1.720,00 NaN    C  
 6        NaN         600 NaN           18,60                11.160,00 NaN    C  
 7          D       1.100 NaN            7,75                 8.525,00 NaN    D  
 8          D         100 NaN            7,86                   786,00 NaN    C  
 9          D         100 NaN            7,86                   786,00 NaN    C  
 10         D         900 NaN            7,86                 7.074,00 NaN    C  ]

我如何把它变成一个 DateFrame？

score 0 · Accepted Answer

0

由于历史原因，具有多个表选项的结果返回 DataFrame 列表。

于 2019-11-22T15:10:31.030 回答

score 0 · Accepted Answer

您可以通过这种方式转换为数据框：

import pandas as pd
dataframe = pd.DataFrame(data)

python - 使用 tabula-py 读取多个 PDF 页面时出错

2 回答 2

Related

Reference