I'm currently trying to implement LSA with Sklearn to find synonyms in multiple Documents. Here is my Code:
#import the essential tools for lsa
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
#other imports
from os import listdir
#load data
datafolder = 'data/'
filenames = []
for file in listdir(datafolder):
if file.endswith(".txt"):
filenames.append(datafolder+file)
#Document-Term Matrix
cv = CountVectorizer(input='filename',strip_accents='ascii')
dtMatrix = cv.fit_transform(filenames).toarray()
print dtMatrix.shape
featurenames = cv.get_feature_names()
print featurenames
#Tf-idf Transformation
tfidf = TfidfTransformer()
tfidfMatrix = tfidf.fit_transform(dtMatrix).toarray()
print tfidfMatrix.shape
#SVD
#n_components is recommended to be 100 by Sklearn Documentation for LSA
#http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
svd = TruncatedSVD(n_components = 100)
svdMatrix = svd.fit_transform(tfidfMatrix)
print svdMatrix
#Cosine-Similarity
#cosine = cosine_similarity(svdMatrix[1], svdMatrix)
Now here is my Problem: the Shape of the Term-DOcument Matrix and the tf-idf Matrix are the same, which is (27,3099). 27 Documents and 3099 words. After the Single Value Decomposition the shape of the Matrix is (27,27). I know you can calculate the cosine-similarity from 2 rows to get there similarity, but i don't think i can get the similiarity of 2 words in my documents by doing that with the SVD-Matrix.
Can someone explain to me what the SVD-Matrix represents and in which ever way i can use that to find synonyms in my Documents?
Thanks in advance.