2

I have trained (fit and transform) a SVD model using 400 documents as part of my effort to build a LSA model. Here is my code:

tfidf_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True)
svd_model = TruncatedSVD(n_components=100, n_iter=10)
lsa_pipeline = Pipeline([('tfidf', tfidf_vectorizer), ('svd', svd_model)])
lsa_model = lsa_pipeline.fit_transform(all_docs)

Now, I want to measure the similarity of two sentences (whether from the same document collection or totally new) and I need to transform these two sentences into vectors. I want to do the transformation in my own way and I need to have the vector of each word in sentence.

How can I find the vector of a word using the lsa_model that I already trained?

And, more broadly speaking, does it make sense to build a LSA model using a collection of documents and then use the same model for measuring the similarity of some sentences from the same document collection?

4

1 回答 1

2

你快到了,你只需要将句子转换成向量

sentence_vector = lsa_pipeline.transform(sentence)

然后使用您选择的任何度量找到句子向量和文档矩阵之间的距离

from sklearn.metrics import pairwise_distances
dist_per_doc_matrix = pairwise_distances(sentence_vector,lsa_model, metric= 'euclidean')

同样,您也可以采用两个句子向量的余弦相似度

参考

于 2018-06-18T19:50:17.137 回答