pdf - 如何通过 R 将目录中的所有 pdf 转换为 txt 格式？

Question

我正在尝试将位于我的计算机目录中的 PDF 文件列表转换为 txt 格式，以便 R 可以读取它并开始文本挖掘。你知道这段代码有什么问题吗？

library(tm) #load text mining library
setwd('D:/Directory') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/Directory/NewsArticles"),readerControl=list(reader=readPlain))
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", ae.corpus, "\"", sep = ""), wait = F)
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt)    # strangely the first try always throws an error..

summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 


ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))

score 1 · Accepted Answer

尝试使用它：

dest <- ""           #same as setwd()
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
lapply(myfiles, function(i) system(paste('""',    #the path to Program files where the pdftotext.exe is saved
                                     paste0('"', i, '"')), wait = FALSE) )

接着

#combine files
files <- list.files(pattern = "[.]txt$")
outFile <- file("output.txt", "w") 
for (i in files){ 
x <- readLines(i) 
writeLines(x[2:(length(x)-1)], outFile) 
} 
close(outFile) 

#read data
txt<-read.table('output.txt',sep='\t', quote = "")

这有什么帮助！

pdf - 如何通过 R 将目录中的所有 pdf 转换为 txt 格式？

1 回答 1

Related

Reference