我正在编写一个 R 程序,该程序涉及分析大量非结构化文本数据并创建词频矩阵。我一直在使用包中的wfm
andwfdf
函数qdap
,但注意到这对于我的需求来说有点慢。看来词频矩阵的产生是瓶颈。
我的函数的代码如下。
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
我意识到 for 循环效率低下,因此为了定位瓶颈,我在没有这部分代码的情况下对其进行了测试(简单地读取每个文本文件并生成词频矩阵),并且几乎没有看到速度改进。例子:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
输入文件是 Twitter 和 Facebook 状态发布。
有什么办法可以提高这段代码的速度吗?
EDIT2:由于体制限制,我不能发布任何原始数据。但是,只是为了说明我正在处理的内容:25k 文本文件,每个文件都包含来自单个 Twitter 用户的所有可用推文。还有另外 10 万个包含 Facebook 状态更新的文件,结构相同。