我需要从文档术语矩阵创建一个相似度矩阵,以便对文档执行最大捕获聚类。到目前为止只找到了距离矩阵的解决方案。尝试了 dist 方法,但它给了我错误的输出。有没有办法使用 R 创建相似度矩阵?我在下面的代码中使用了 tm 包,但我并不局限于此,如果有其他好的包,请告诉我。到目前为止的代码:
install.packages("tm")
install.packages("rJava")
install.packages("Snowball")
install.packages("RWeka")
install.packages("RWekajars")
install.packages("XML")
install.packages("openNLP")
install.packages("openNLPmodels.en")
Sys.setenv(NOAWT=TRUE)
library(XML)
library(rJava)
library(Snowball)
library(RWeka)
library(tm)
library(openNLP)
library(openNLPmodels.en)
sample = c(
"cc ee aa",
"dd bb ee",
"bb cc ee dd",
"cc ee dd aa",
"bb ee",
"cc dd aa",
"bb cc aa",
"bb cc",
"cc ee dd"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
# need to create similarity matrix here
dist(dtm, method = "manhattan", diag = FALSE, upper = FALSE)
给定样本的输出应如下所示
相似度矩阵定义为:
if (i < j)
a[i][j] = sim[i][j]
else
a[i][j] = 0