1

我需要从文档术语矩阵创建一个相似度矩阵,以便对文档执行最大捕获聚类。到目前为止只找到了距离矩阵的解决方案。尝试了 dist 方法,但它给了我错误的输出。有没有办法使用 R 创建相似度矩阵?我在下面的代码中使用了 tm 包,但我并不局限于此,如果有其他好的包,请告诉我。到目前为止的代码:

install.packages("tm")
install.packages("rJava")
install.packages("Snowball")
install.packages("RWeka")
install.packages("RWekajars")
install.packages("XML")
install.packages("openNLP")
install.packages("openNLPmodels.en")

Sys.setenv(NOAWT=TRUE)

library(XML)
library(rJava)
library(Snowball)
library(RWeka)
library(tm)
library(openNLP)
library(openNLPmodels.en)

sample = c(
"cc ee aa", 
"dd bb ee",   
"bb cc ee dd",
"cc ee dd aa",
"bb ee",
"cc dd aa",
"bb cc aa",
"bb cc",
"cc ee dd"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)

dtm <- DocumentTermMatrix(corpus)
inspect(dtm)

# need to create similarity matrix here
dist(dtm, method = "manhattan", diag = FALSE, upper = FALSE)

给定样本的输出应如下所示

相似矩阵

相似度矩阵定义为:

if (i < j) 
    a[i][j] = sim[i][j] 
else 
    a[i][j] = 0 
4

0 回答 0