r - 尝试从包含 70 个 pdf 文件的目录中的每个 pdf 中提取页面子集

Question

我正在使用 tidyverse、tidytext 和 pdftools。我想在 70 个 pdf 文件的目录中解析单词。我正在使用这些工具成功地做到这一点，但下面的代码抓取了所有页面而不是我想要的子集。我需要跳过前两页，然后为每个 pdf 选择第 3 页到文件末尾。

directory <- "Student_Artifacts/"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, (pdf_text))
my_data <- data_frame(document = pdf_names, text = pdfs_text)

我发现通过将 [3:12] 放在这样的括号中，我可以获取第 3-12 个文档：

pdfs_text <- map(pdfs, (pdf_text))[3:12]

这不是我想要的。如何使用 [3:12] 规范从每个 pdf 文件中提取我想要的页面？

score 3 · Accepted Answer

首先，您可以从的映射中的每个 PDF 索引第 3 到第 12 页pdf_text，只需进行一些非常小的更改：

pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12])

但这假设所有 70 个 PDF 都是 13 页。这也可能很慢，特别是如果其中一些真的很大。尝试这样的事情（我使用 R 的 PDF 文档进行演示）：

library(furrr)
#> Loading required package: future
library(pdftools)
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

plan(multiprocess)

directory <- file.path(R.home("doc"), "manual")
pdf_names <- list.files(directory, pattern = "\\.pdf$", full.names = TRUE)
# Drop the full reference manual since it's so big
pdf_names %<>% str_subset("fullrefman.pdf", negate = TRUE)
pdfs_text <- future_map(pdf_names, pdf_text, .progress = TRUE)
#> Progress: ----------------------------------------------------------------------------------- 100%

my_data   <- tibble(
  document = basename(pdf_names), 
  text     = map_chr(pdfs_text, ~ {
    str_c("Page ", seq_along(.x), ": ", str_squish(.x)) %>% 
      tail(-2) %>% 
      str_c(collapse = "; ")
  })
)

my_data
#> # A tibble: 6 x 2
#>   document    text                                                         
#>   <chr>       <chr>                                                        
#> 1 R-admin.pdf "Page 3: i Table of Contents 1 Obtaining R . . . . . . . . .~
#> 2 R-data.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 3 R-exts.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 4 R-intro.pdf "Page 3: i Table of Contents Preface . . . . . . . . . . . .~
#> 5 R-ints.pdf  "Page 3: i Table of Contents 1 R Internal Structures . . . .~
#> 6 R-lang.pdf  "Page 3: i Table of Contents 1 Introduction . . . . . . . . ~

^{由reprex 包（v0.3.0）于 2019 年 10 月 19 日创建}

要点：

tail(-2)正在做您最关心的工作：删除前两页。通常你tail()用来抓取最后n一页，但它也非常适合抓取除第一页以外的所有n页面 - 只需使用负数即可。
和正在并行读取 PDF plan()，future_map()每个虚拟内核一次读取一个 PDF。还有，进度条！
我在text这里的构造中做了一些花哨的字符串连接，因为看起来您最终希望将每个文档页面的全文放在最终表格的一个单元格中。我在每个页面的文本之间插入“; Page [n]:”，这样数据就不会丢失，而且我还在所有文本中删除了额外的空白，因为通常有很多。

r - 尝试从包含 70 个 pdf 文件的目录中的每个 pdf 中提取页面子集

1 回答 1

Related

Reference