excel - PDF 数据和表格抓取到 Excel

Question

我正在尝试找出一种提高数据输入工作效率的好方法。

我想做的是想出一种从 PDF 中抓取数据并将其输入 Excel 的方法。

更具体地说，我正在使用的数据来自杂货店传单。就目前而言，我们必须手动将传单中的每笔交易输入数据库。传单样本是http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551

我希望做的是为产品、价格和预定义选项（忠诚卡、优惠券、选择品种......之类的东西）提供列。

任何帮助将不胜感激，如果我需要更具体，请告诉我。

score 27 · Accepted Answer

在查看了 OP 链接到的特定 PDF 之后，我不得不说这并没有完全显示典型的表格格式。

它在“单元格”内包含许多图像，但单元格并非都严格垂直或水平对齐：

OP中链接的PDF的第6页

所以这甚至不是一张“漂亮”的桌子，而是一张极其丑陋和尴尬的桌子……

说了这么多，我还要补充：

一般来说，从 PDF 中提取“漂亮”的表格非常困难......

标准 PDF 不提供有关它们在页面上绘制的语义的任何提示：语法提供的唯一区别是矢量元素（线条、填充...）、图像和文本之间的区别。

通过解析 PDF 源代码以编程方式识别任何字符是否是表格的一部分或行的一部分，或者只是在其他空白区域内的孤独的单个字符都不容易。

有关为什么PDF 文件格式不应该被认为适合托管可提取的结构化数据的背景，请参阅这篇文章：

为什么为文档更新美元如此困难（ProPublica-Website）

...但是使用 TabulaPDF 这样做效果很好！

说了上面的话，现在让我补充一下：

对于一个令人惊叹的开源工具系列，它每周都在从 PDF 中提取表格数据（除非它们是扫描页面）变得越来越好——这与我在介绍性段落中所说的相矛盾！- 查看TabulaPDF。请参阅以下链接：

Tabula-Extractor 是用 Ruby 编写的。在后台，它使用 PDFBox（用 Java 编写）和其他一些第三方库。要运行，Tabula-Extractor 需要安装 JRuby-1.7。

安装 Tabula-Extractor

我直接从其 GitHub 源代码存储库中使用 Tabula-Extractor 的“前沿”版本。让它工作非常容易，因为在我的系统上 JRuby-1.7.4_0 已经存在：

mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

这个 Git 克隆中已经包含了所需的库，因此无需安装 PDFBox。命令行工具位于/bin/子目录中。

探索命令行选项：

~/svn-stuff/git.tabula-extractor/bin/tabula -h

Tabula helps you extract tables from PDFs

Usage:
       tabula [options] <pdf_file>
where [options] are:
         --pages, -p <s>:   Comma separated list of ranges, or all. Examples:
                            --pages 1-3,5-7, --pages 3 or --pages all. Default
                            is --pages 1 (default: 1)
          --area, -a <s>:   Portion of the page to analyze
                            (top,left,bottom,right). Example: --area
                            269.875,12.75,790.5,561. Default is entire page
       --columns, -c <s>:   X coordinates of column boundaries. Example
                            --columns 10.1,20.2,30.3
      --password, -s <s>:   Password to decrypt document. Default is empty
                            (default: )
             --guess, -g:   Guess the portion of the page to analyze per page.
             --debug, -d:   Print detected table areas instead of processing.
        --format, -f <s>:   Output format (CSV,TSV,HTML,JSON) (default: CSV)
       --outfile, -o <s>:   Write output to <file> instead of STDOUT (default:
                            -)
       --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style
                            extraction (if there are ruling lines separating
                            each cell, as in a PDF of an Excel spreadsheet)
    --no-spreadsheet, -n:   Force PDF not to be extracted using
                            spreadsheet-style extraction (if there are ruling
                            lines separating each cell, as in a PDF of an Excel
                            spreadsheet)
            --silent, -i:   Suppress all stderr output.
  --use-line-returns, -u:   Use embedded line returns in cells. (Only in
                            spreadsheet mode.)
           --version, -v:   Print version and exit
              --help, -h:   Show this message

提取 OP 想要的表

我什至没有尝试从 OP 的怪物 PDF 中提取这个丑陋的表格。我会把它作为练习留给那些感觉足够冒险的读者......

相反，我将演示如何提取一个“漂亮”的表。我将从官方 PDF-1.7 规范中获取第 651-653 页，这里用屏幕截图表示：

官方 PDF-1.7 规范的第 651-653 页

我使用了这个命令：

 ~/svn-stuff/git.tabula-extractor/bin/tabula \
   -p 651,652,653 -g -n -u -f CSV            \
    ~/Downloads/pdfs/PDF32000_2008.pdf

将生成的 CSV 导入 LibreOffice Calc 后，电子表格如下所示：

导入 CSV 后来自 LibreOffice 的屏幕截图

对我来说，这看起来像是对分布在 3 个不同 PDF 页面上的表格的完美提取。（甚至表格单元格中使用的换行符也进入了电子表格。）

更新

这是一个 ASCIinema 截屏视频（您也可以借助命令行工具在 Linux/MacOSX/Unix 终端本地下载asciinema和重播），主演tabula-extractor：

excel - PDF 数据和表格抓取到 Excel

1 回答 1

一般来说，从 PDF 中提取“漂亮”的表格非常困难......

...但是使用 TabulaPDF 这样做效果很好！

安装 Tabula-Extractor

提取 OP 想要的表

更新

Related

Reference