0

这是我第一次尝试自然语言处理,所以我从潜在语义分析开始,并使用本教程来构建算法。在对其进行测试后,我发现它只对第一个语义词进行分类,并在其他文档之上一遍又一遍地重复相同的词。

我也尝试将在HERE中找到的文件提供给它,它的作用完全相同。在其他主题中多次重复同一主题的值。

谁能帮忙解释发生了什么?我一直在搜索,一切似乎都与教程中的完全一样。

testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
                ]
    stopwords = ['and','edition','for','in','little','of','the','to']
    ignorechars = ''',:'!'''

    #First we apply the standard SKLearn algorithm to compare with.
    for element in testDocs:
        #tokens.append(tokenizer.tokenize(element.lower()))
        element = element.lower()

        print(testDocs)

    #Vectorize the features.
    vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
    #Store the values in matrix X.
    X = vectorizer.fit_transform(testDocs)
#Apply LSA.
    lsa = TruncatedSVD(n_components=3, n_iter=100)
    lsa.fit(X)

    #Get a list of the terms in the order it was decomposed.
    terms = vectorizer.get_feature_names()
    print("Terms decomposed from the document: " + str(terms))
    print()

    #Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
    print("Number of components in element 0 of matrix of components:")
    print(lsa.components_[0])
    print("Shape: " + str(lsa.components_.shape))
    print()
    for i, comp in enumerate(lsa.components_):
        #Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
        termsInComp = zip(terms, comp)
        #Sort the terms according to...
        sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
        print("Concept %d", i)
        for term in sortedTerms:
            print(term[0], end="\t")
        print()
4

0 回答 0