The error you receive basically means that the function sim2
does not work with the lsa
object. However, I'm not really sure if I understand the question. Why do you want to convert the dfm
to lsa
textmatrix format in the first place?
If you want to calculate cosine similarity between texts, you can do this directly in quenteda
library(quanteda)
texts <- c(d1 = "Shipment of gold damaged in a fire",
d2 = "Delivery of silver arrived in a silver truck",
d3 = "Shipment of gold arrived in a truck" )
texts_dfm <- dfm(texts)
textstat_simil(texts_dfm,
margin = "documents",
method = "cosine")
#> textstat_simil object; method = "cosine"
#> d1 d2 d3
#> d1 1.000 0.359 0.714
#> d2 0.359 1.000 0.598
#> d3 0.714 0.598 1.000
If you want to use sim2
from text2vec
, you can do so using the same object without converting it first:
library(text2vec)
sim2(x = texts_dfm, method = "cosine", norm = "l2")
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#> d1 d2 d3
#> d1 1.0000000 0.3585686 0.7142857
#> d2 0.3585686 1.0000000 0.5976143
#> d3 0.7142857 0.5976143 1.0000000
As you can see, the results are the same.
Update
As by the comments, I now understand that you want to apply a transformation of your data via Latent semantic analysis. You can follow the tutorial linked below and plug in the dfm instead of the dtm that is used in the tutorial:
texts_dfm_tfidf <- dfm_tfidf(texts_dfm)
library(text2vec)
lsa = LSA$new(n_topics = 2)
dtm_tfidf_lsa = fit_transform(texts_dfm_tfidf, lsa) # I get a warning here, probably due to the size of the toy dfm
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim
#> d1 d2 d3 d4
#> d1 1.000000000 -0.002533794 0.5452992 0.999996189
#> d2 -0.002533794 1.000000000 0.8368571 -0.005294431
#> d3 0.545299245 0.836857086 1.0000000 0.542983071
#> d4 0.999996189 -0.005294431 0.5429831 1.000000000
Note that these results differ from run to run unless you use set.seed()
.
Or if you want to do everything in quanteda
:
texts_lsa <- textmodel_lsa(texts_dfm_tfidf, 2)
textstat_simil(as.dfm(texts_lsa$docs),
margin = "documents",
method = "cosine")
#> textstat_simil object; method = "cosine"
#> d1 d2 d3 d4
#> d1 1.00000 -0.00684 0.648 1.00000
#> d2 -0.00684 1.00000 0.757 -0.00894
#> d3 0.64799 0.75720 1.000 0.64638
#> d4 1.00000 -0.00894 0.646 1.00000