2

I'm trying to cluster similar documents using the R language. As a first step, I compute the term-document matrix for my set of documents. Then I create the latent semantic space for the term-document matrix previously created. I decided to use use LSA in my expriment because the results of clustering using just the term-document matrix were awful . Is possible to build a dissimilarity matrix (with cosine measure) using the the LSA space created? I need to do this because the clustering algorithm that I'm using requires a dissimilarity matrix as input.

Here is my code:

require(cluster);
require (lsa);

myMatrix = textmatrix("/home/user/DocmentsDirectory");
myLSAspace = lsa(myMatrix, dims=dimcalc_share());

I need to build a dissimilarity matrix (using cosine measure) from LSA space, so I can call the cluster algorithm as follows:

clusters = pam(dissimilartiyMatrix,10,diss=TRUE);

Any suggestions?

Thanks in advance!

4

2 回答 2

5

$sk要比较 LSA 空间中的两个文档,您可以取和返回的$dk矩阵的叉积,lsa()以获得低维 LSA 空间中的所有文档。这是我所做的:

lsaSpace <- lsa(termDocMatrix)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

请参阅http://en.wikipedia.org/wiki/Latent_semantic_analysis,其中说您现在可以使用 LSA 结果“通过比较向量 sk*d_j 和 sk*d_q 来查看相关文档 j 和 q 在低维空间中的情况(通常通过余弦相似度)。”

于 2013-11-26T17:57:46.480 回答
2

您可以使用 package arules,这里是一个例子:

 library(arules)
 dissimilarity(x=matrix(seq(1,10),ncol=2),method='cosine')
          1         2         3         4
2 -4.543479                              
3 -4.811989 -5.231234                    
4 -5.080052 -5.563952 -6.024433          
5 -5.343350 -5.885304 -6.395740 -6.877264
于 2013-03-05T17:09:39.507 回答