-1

我有一个文档特征矩阵(DFM):我想将其转换为 LSA 对象,最后计算每个文档之间的余弦相似度。

这是我遵循的段落

lsa_t2 <- convert(DFM_tfidf, to = "lsa" , omit_empty = TRUE)
t2_lsa_tfidf_cos_sim = sim2(x = lsa_t2, method = "cosine", norm = "l2")

但我收到此错误:

sim2 中的错误(x = lsa_t2,方法 =“余弦”,范数 =“l2”):
继承(x,“矩阵”)|| 继承(x,“矩阵”)不是 TRUE

为了提供更多上下文,这就是 las_t2 的样子

lsa_t2 的样子

任何文档都包含文本(我已经检查过了),并且在我清理 dfm 之前过滤了没有文本的 outdocuments。

知道发生了什么吗?

4

1 回答 1

1

The error you receive basically means that the function sim2 does not work with the lsa object. However, I'm not really sure if I understand the question. Why do you want to convert the dfm to lsa textmatrix format in the first place?

If you want to calculate cosine similarity between texts, you can do this directly in quenteda

library(quanteda)

texts <- c(d1 = "Shipment of gold damaged in a fire",
           d2 = "Delivery of silver arrived in a silver truck",
           d3 = "Shipment of gold arrived in a truck" )

texts_dfm <- dfm(texts)

textstat_simil(texts_dfm, 
               margin = "documents",
               method = "cosine")
#> textstat_simil object; method = "cosine"
#>       d1    d2    d3
#> d1 1.000 0.359 0.714
#> d2 0.359 1.000 0.598
#> d3 0.714 0.598 1.000

If you want to use sim2 from text2vec, you can do so using the same object without converting it first:

library(text2vec)
sim2(x = texts_dfm, method = "cosine", norm = "l2")
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#>           d1        d2        d3
#> d1 1.0000000 0.3585686 0.7142857
#> d2 0.3585686 1.0000000 0.5976143
#> d3 0.7142857 0.5976143 1.0000000

As you can see, the results are the same.

Update

As by the comments, I now understand that you want to apply a transformation of your data via Latent semantic analysis. You can follow the tutorial linked below and plug in the dfm instead of the dtm that is used in the tutorial:

texts_dfm_tfidf <- dfm_tfidf(texts_dfm)


library(text2vec)
lsa = LSA$new(n_topics = 2)
dtm_tfidf_lsa = fit_transform(texts_dfm_tfidf, lsa) # I get a warning here, probably due to the size of the toy dfm
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim
#>              d1           d2        d3           d4
#> d1  1.000000000 -0.002533794 0.5452992  0.999996189
#> d2 -0.002533794  1.000000000 0.8368571 -0.005294431
#> d3  0.545299245  0.836857086 1.0000000  0.542983071
#> d4  0.999996189 -0.005294431 0.5429831  1.000000000

Note that these results differ from run to run unless you use set.seed().

Or if you want to do everything in quanteda:

texts_lsa <- textmodel_lsa(texts_dfm_tfidf, 2)

textstat_simil(as.dfm(texts_lsa$docs), 
               margin = "documents",
               method = "cosine")
#> textstat_simil object; method = "cosine"
#>          d1       d2    d3       d4
#> d1  1.00000 -0.00684 0.648  1.00000
#> d2 -0.00684  1.00000 0.757 -0.00894
#> d3  0.64799  0.75720 1.000  0.64638
#> d4  1.00000 -0.00894 0.646  1.00000
于 2020-02-18T15:16:32.050 回答