我们可以使用图书馆pdftools
library(pdftools)
# you can use an url or a path
pdf_url <- "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf"
# `pdf_text` converts it to a list
list_output <- pdftools::pdf_text('https://cran.r-project.org/web/packages/pdftools/pdftools.pdf')
# you get an element by page
length(list_output) # 5 elements for a 5 page pdf
# let's print the 5th
cat(list_output[[5]])
# Index
# pdf_attachments (pdf_info), 2
# pdf_convert (pdf_render_page), 3
# pdf_fonts (pdf_info), 2
# pdf_info, 2, 3
# pdf_render_page, 2, 3
# pdf_text, 2
# pdf_text (pdf_info), 2
# pdf_toc (pdf_info), 2
# pdftools (pdf_info), 2
# poppler_config (pdf_render_page), 3
# render (pdf_render_page), 3
# suppressMessages, 2
# 5
为了从文章中提取摘要,OP 选择在Abstract
和之间提取内容Introduction
。
我们将获取一个CRAN
pdf 列表并提取作者作为和之间的文本Author
(Maintainer
我精心挑选了一些具有兼容格式的文本)。
为此,我们在我们的 url 列表上循环,然后提取内容,将每个 pdf 的所有文本折叠成一个,然后使用regex
.
urls <- c(pdftools = "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf",
Rcpp = "https://cran.r-project.org/web/packages/Rcpp/Rcpp.pdf",
jpeg = "https://cran.r-project.org/web/packages/jpeg/jpeg.pdf")
lapply(urls,function(url){
list_output <- pdftools::pdf_text(url)
text_output <- gsub('(\\s|\r|\n)+',' ',paste(unlist(list_output),collapse=" "))
trimws(regmatches(text_output, gregexpr("(?<=Author).*?(?=Maintainer)", text_output, perl=TRUE))[[1]][1])
})
# $pdftools
# [1] "Jeroen Ooms"
#
# $Rcpp
# [1] "Dirk Eddelbuettel, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates and John Chambers"
#
# $jpeg
# [1] "Simon Urbanek <Simon.Urbanek@r-project.org>"