我正在从 pdf 中提取文本。删除标点符号并查看重复的关键单词以及它们出现的频率。
library(pdftools)
library(tm)
setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF")
files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)
corp <- Corpus(URISource(files),
readerControl = list(reader = readPDF))
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
inspect(opinions.tdm[1:10,])
我目前收到一个错误:
(opinions.tdm, 1:10, )中的错误
[.simple_triplet_matrix
:下标越界
我opinions.tdm
的有以下特点:
opinion.tdm 列表长度为 6。nrow 整数 [1]。ncol [1]。暗名列表 [2]。属性 [3]