如果我理解正确,lapply
解决方案可能是回答您问题的方法。这与您链接到的答案相同,但这里有一个独立的示例,可能更接近您的用例:
加载库和可重现的数据(请在您以后的问题中包括这些)
library(tm)
library(RWeka)
data(crude)
你的二元标记器...
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
通过检查随机样本来检查它是否有效......
inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)
Non-/sparse entries: 1/35
Sparsity : 97%
Maximal term length: 18
Weighting : term frequency (tf)
Docs
Terms 248 273 349 352 353 368
for their 0 0 0 0 0 0
for west 0 0 0 0 0 0
forced it 0 0 0 0 0 0
forced to 0 0 0 0 0 0
forces trying 1 0 0 0 0 0
foreign investment 0 0 0 0 0 0
这是您的问题的答案:
现在使用一个lapply
函数来计算术语文档矩阵中术语向量中每个项目的关联词。术语向量最简单地用 访问txtTdmBi$dimnames$Terms
。例如txtTdmBi$dimnames$Terms[[1005]]
是“外商投资”。
这里我使用llply
了 from plyr
package 所以我们可以有一个进度条(为大作业提供安慰),但它与基本功能基本相同lapply
。
library(plyr)
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )
输出是一个列表,其中列表中的每个项目都是命名数字的向量,其中名称是术语,数字是相关值。例如,要查看与“外国投资”相关的术语,我们可以像这样访问列表:
dat[[1005]]
以下是与该术语相关的术语(我刚刚粘贴在前几个)
168 million 1986 was 1987 early 300 mln 31 pct
1.00 1.00 1.00 1.00 1.00
a bit a crossroads a leading a political a population
1.00 1.00 1.00 1.00 1.00
a reduced a series a slightly about zero activity continues
1.00 1.00 1.00 1.00 1.00
advisers are agricultural sector agriculture the all such also reviews
1.00 1.00 1.00 1.00 1.00
and advisers and attract and imports and liberalised and steel
1.00 1.00 1.00 1.00 1.00
and trade and virtual announced since appears to are equally
1.00 1.00 1.00 1.00 1.00
are recommending areas for areas of as it as steps
1.00 1.00 1.00 1.00 1.00
asia with asian member assesses indonesia attract new balance of
1.00 1.00 1.00 1.00 1.00
那是你想做的吗?
顺便说一句,如果您的术语文档矩阵非常大,您可能想尝试以下版本findAssocs
:
# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1
findAssocsBig <- function(u, term, corlimit){
suppressWarnings(x.cor <- gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),
as.matrix(t(u[ u$dimnames$Terms == term, ])) ))
x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
return(x)
}
这可以像这样使用:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )
这样做的好处是它使用不同的方法将 TDM 转换为矩阵tm:findAssocs
。这种不同的方法更有效地使用内存,因此可以防止这种消息:Error: cannot allocate vector of size 1.9 Gb
发生。
快速基准测试表明这两个findAssocs
函数的速度大致相同,因此主要区别在于内存的使用:
library(microbenchmark)
microbenchmark(
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)
Unit: seconds
expr min lq median
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)) 10.82369 11.03968 11.25492
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)) 10.70980 10.85640 11.14156
uq max neval
11.39326 11.89754 10
11.18877 11.97978 10