使用“r-project tm parallel”作为搜索策略的 LMGTFY 将其作为第三次命中:
使用 tm 进行分布式文本挖掘
直接从幻灯片中复制: 解决方案: 1. 分布式存储 复制到 DFS 的数据集('DistributedCorpus') 只有关于语料库的元信息保留在内存中 2. 并行计算 并行 MapReduce 范式中所有元素的计算操作(Map) 工作马 tm_map () 和 TermDocumentMatrix() 可以按需检索已处理的文档(修订)。
在 tm 的“插件”包中实现:tm.plugin.dc。
#Distributed Text Mining in R
> library("tm.plugin.dc")
> dc <- DistributedCorpus(DirSource("Data/reuters"),
list(reader = readReut21578XML) )
> dc <- as.DistributedCorpus(Reuters21578)
> summary(dc)
#A corpus with 21578 text documents
#The metadata consists of 2 tag-value pairs and a data frame
#Available tags are:
#create_date creator
#Available variables in the data frame are:
--- Distributed Corpus ---
#Available revisions:
#Active revision: 20100417144823
#DistributedCorpus: Storage
#- Description: Local Disk Storage
#- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
#- Current chunk size [bytes]: 10485760
> dc <- tm_map(dc, stemDocument)
> print(object.size(Reuters21578), units = "Mb")
#109.5 Mb
> dc
#A corpus with 21578 text documents
> dc_storage(dc)
DistributedCorpus: Storage
- Description: Local Disk Storage
- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
- Current chunk size [bytes]: 10485760
> dc[[3]]
Texas Commerce Bancshares Inc
s Texas
Commerce Bank-Houston said it filed an application with the
Comptroller of the Currency in an effort to create the largest
banking network in Harris County.
The bank said the network would link 31 banks having
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits.
> print(object.size(dc), units = "Mb")
# 0.6 Mb
使用以下术语进行进一步搜索: tm, snow ,parLapply ...生成此链接:
cl <- makeCluster(4, type="SOCK")
bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime)
bigmatrix <- matrix(0, 2000, 2000)
sleeptime <- rep(1, 100)
tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix))
cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed))
tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix))
cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed))