0

全部,

这是对我在此线程中回复的内容的重新发布。尝试在 gensim 中打印 LSI 主题时,我得到了一些完全错误的结果。这是我的代码:

try:
    from gensim import corpora, models
except ImportError as err:
    print err

class LSI:
    def topics(self, corpus):
        tfidf = models.TfidfModel(corpus)
        corpus_tfidf = tfidf[corpus]
        dictionary = corpora.Dictionary(corpus)
        lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
        print lsi.show_topics()

if __name__ == '__main__':
    data = '../data/data.txt'
    corpus = corpora.textcorpus.TextCorpus(data)
    LSI().topics(corpus)

这会将以下内容打印到控制台。

-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)" + ......

我希望能够打印出@2er0在这里所做的主题,但我得到了这样的结果。请参见下文并注意打印的第二个项目是一个元组,我不知道它来自哪里。data.txt 是一个包含多个段落的文本文件。就这些。

对此的任何想法都会很棒!亚当

4

2 回答 2

4

要回答为什么您的 LSI 主题是元组而不是单词,请检查您的输入语料库。

它是从通过转换为语料库的文档列表创建的corpus = [dictionary.doc2bow(text) for text in texts]吗?

因为如果不是,并且您只是从序列化语料库中读取它而不阅读字典,那么您将不会在主题输出中获得单词。

下面我的代码工作并打印出带有加权词的主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i

上述输出:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"
于 2013-03-12T02:34:07.433 回答
0

它看起来很难看,但这可以完成工作(只是一种纯粹的基于字符串的方法):

#x = lsi.show_topics()
x = '-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)"'
y = [(j.split("*")[0], (j.split("*")[1].split(",")[0].lstrip('"('), j.split("*")[1].split(",")[1].strip().rstrip(')"'))) for j in [i for i in x.strip().split(" + ")]]

for i in y:
  print y

上述输出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))

如果没有,您可以尝试 lsi.print_topic(i) 而不是 lsi.show_topics()

for i in range(len(lsi.show_topics())):
  print lsi.print_topic(i)
于 2013-03-07T13:46:37.920 回答