r - 文本挖掘 pdf 文件/词频问题

Question

我正在尝试挖掘具有丰富 pdf 编码和图形的文章的 pdf。我注意到，当我挖掘一些 pdf 文档时，我得到的高频词是 phi、taeoe、toe、sigma、gamma 等。它适用于一些 pdf 文档，但我会得到这些随机的希腊字母。这是字符编码的问题吗？（顺便说一句，所有文件都是英文的）。有什么建议么？

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

score 0 · Accepted Answer

我认为这ghostscript在这里造成了所有麻烦。假设pdfinfo并pdftotext正确安装，此代码可以正常工作而不会生成您提到的奇怪单词：

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

我们可以用词云将您的 pdf 文件中最常用词的结果可视化：

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

显然这个结果并不完美；主要是因为词干提取几乎没有达到 100% 可靠的结果（例如，我们仍然将“问题”和“问题”作为单独的词；或“方法”和“方法”）。我不知道 R 中有任何可靠的词干算法，尽管SnowballC做得相当不错。

r - 文本挖掘 pdf 文件/词频问题

1 回答 1

Related

Reference