r - 在 R 中使用 readPDF（tm 包）

Question

我是 R 的初学者，在使用该tm软件包时遇到了一些麻烦。我需要从第 55 页到第 300 页中提取特定数据，并认为 R 可能是这样做的好方法。（如果有人有更好的主意，请告诉我！）我做了一些搜索，在安装了tm包和xpdf包之后，我尝试阅读这个并尝试了 zx8754 的解决方案，但没有成功。我怀疑它与 readPDF 命令有关——我得到以下信息：

readPDF(PdftotextOptions = "-layout") 中的错误：未使用的参数 (PdftotextOptions = "-layout")

我认为这与尝试将tm包和xpdf包一起使用有关，所以我阅读了 Tony Breyal 的解决方案（我不能发布超过 2 个链接），将 pdfinfo 和 pdftotext 作为环境变量（我在 Win 8 ) 并重新启动。我确定我遗漏了一些东西——现在我在 R 的工作目录中有 pdftotext.exe。任何人都可以帮我正确配置它，以便 tm 包正确调用 xpdf 文件并像它应该的那样 readPDF 函数？

再说一次，我对此很陌生，所以如果我离开了，请道歉。所有帮助将不胜感激。

提前致谢，

贾斯汀

score 2 · Accepted Answer

为了帮助您入门，这里是一个readPDF用于读取 PDF 文件的完整命令的示例。readPDF当我尝试直接从您提供的链接中检索 PDF 文件时抛出错误，因此我首先将 PDF 文件下载到我的工作目录。

library(tm)

# File name
filename = "ea0607.pdf"

# Read the PDF file
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
                                                 language = "en",
                                                 id = "id1")

上面的代码将 PDF 文件转换为文本并将结果存储在doc. doc实际上是一个列表，从下面的代码可以看出：

str(doc)

List of 2
 $ content: chr [1:23551] "  STATE UNIVERSITY SYSTEM OF FLORIDA" "" "EXPENDITURE ANALYSIS" "      2006-2007" ...
 $ meta   :List of 7
  ..$ author       : chr "greg.jacques"
  ..$ datetimestamp: POSIXlt[1:1], format: "2007-12-10 11:33:48"
  ..$ description  : NULL
  ..$ heading      : chr " PGM=EASUSI-V01                                        STATE UNIVERSITY SYSTEM                                                 "| __truncated__
  ..$ id           : chr "ea0607.pdf"
  ..$ language     : chr "en"
  ..$ origin       : chr "Acrobat PDFMaker 8.1 for Word"
  ..- attr(*, "class")= chr "TextDocumentMeta"
 - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"

PDF 文件的文本存储在中doc$content，同时doc$meta包含有关 PDF 文件的各种元数据。每行doc$content是 PDF 文件中的一行。这是 PDF 文件的第 300 到 310 行：

doc$content[300:310]

 [1] ""                                                                                                                      
 [2] "and General (E&G) budget entity. The Expenditure Analysis continues to reflect special units separately and the"       
 [3] ""                                                                                                                      
 [4] "traditional program components and related activities have been further defined to support the funding formula. The"   
 [5] ""                                                                                                                      
 [6] "Expenditure Analysis format was revised in 1995-96 to include all activities in the funding formula as well as college"
 [7] ""                                                                                                                      
 [8] "detail by activity for the UF Health Science Center, the USF Health Science Center and the FSU Medical School. A"      
 [9] ""                                                                                                                      
[10] "definition of each follows:"                                                                                           
[11] ""

希望这将帮助您入门。

r - 在 R 中使用 readPDF（tm 包）

1 回答 1

Related

Reference