pdf - 从 PDF 文件中提取文本数据

Question

是否可以在 R 中解析 PDF 文件中的文本数据？似乎没有用于此类提取的相关软件包，但有没有人尝试或看到在 R 中完成此操作？

在Python 中有 PDFMiner，但如果可能的话，我想把这个分析全部保存在 R 中。

有什么建议么？

score 29 · Accepted Answer

我在Linux 系统上pdftotext取得了相当大的成功。默认情况下，它foo.txt从给定创建foo.pdf。

也就是说，文本挖掘包可能有转换器。快速 rseek.org 搜索似乎与您的疯狂搜索一致。

score 28 · Accepted Answer

28

这是一个非常古老的线程，但供将来参考：pdftools R 包从 PDF 中提取文本。

于 2016-07-06T08:08:13.663 回答

score 9 · Accepted Answer

一位同事让我开始使用这个方便的开源工具：http ://tabula.nerdpower.org/ 。安装、上传 PDF，然后在 PDF 中选择需要数据化的表格。不是 R 中的直接解决方案，但肯定比体力劳动好。

score 9 · Accepted Answer

纯粹的 R 解决方案可能是：

library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), 
      readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

那么你将在一个数组中有pdf行。

score 6 · Accepted Answer

install.packages("pdftools")
library(pdftools)


download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
              "56901.DEN.Gamebook", mode = "wb")

txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

score 5 · Accepted Answer

tabula PDF 表格提取器应用程序基于基于 Java JAR 包tabula-extractor的命令行应用程序。

R tabulizer包提供了一个 R 包装器，可以轻松地将路径传递到 PDF 文件并从数据表中提取数据。

Tabula 可以很好地猜测表格的位置，但您也可以通过指定页面的目标区域来告诉它要查看页面的哪个部分。

可以从多个页面中提取数据，如果需要，可以为每个页面指定不同的区域。

有关示例用例，请参阅：当文档成为数据库时 – Tabula PDF Table Extractor 的 Tabulizer R Wrapper。

score 2 · Accepted Answer

我使用外部实用程序进行转换并从 R 调用它。所有文件都有一个包含所需信息的前导表

设置pdftotxt.exe的路径并将pdf转换为文本

exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"

for(i in 1:length(pdfFracList)){
    fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
    pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
    txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
    print(paste0("File number ", i, ", Processing file ", pdfSource))
    system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

pdf - 从 PDF 文件中提取文本数据

7 回答 7

Related

Reference