我正在使用gensim
一些 NLP 任务。我创建了一个语料库,dictionary.doc2bow
其中. 现在我想在运行 LDA 模型之前过滤掉具有低 tf-idf 值的项。我查看了语料库类的文档,但找不到访问这些术语的方法。有任何想法吗?谢谢你。dictionary
corpora.Dictionary
问问题
7274 次
4 回答
5
假设您的语料库如下:
corpus = [dictionary.doc2bow(doc) for doc in documents]
运行 TFIDF 后,您可以检索低值单词列表:
tfidf = TfidfModel(corpus, id2word=dictionary)
low_value = 0.2
low_value_words = []
for bow in corpus:
low_value_words += [id for id, value in tfidf[bow] if value < low_value]
然后在运行 LDA 之前将它们从字典中过滤出来:
dictionary.filter_tokens(bad_ids=low_value_words)
现在重新计算语料库,过滤掉低价值的词:
new_corpus = [dictionary.doc2bow(doc) for doc in documents]
于 2016-03-11T22:37:52.497 回答
3
这与之前的答案基本相同,但额外处理了 tf-idf 表示中由于 0 分而缺失的单词(所有文档中都存在术语)。以前的答案没有过滤这些术语,它们仍然出现在最终的语料库中。
#Same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
#Filter low value words and also words missing in tfidf models.
low_value = 0.025
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
tfidf_ids = [id for id, value in tfidf[bow]]
bow_ids = [id for id, value in bow]
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing
new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
#reassign
corpus[i] = new_bow
于 2018-01-30T14:06:06.053 回答
3
这是旧的,但是如果您想查看每个文档级别的内容,请执行以下操作:
#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
#filter low value words
low_value = 0.025
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
new_bow = [b for b in bow if b[0] not in low_value_words]
#reassign
corpus[i] = new_bow
于 2017-04-01T16:35:10.857 回答
1
假设您有一个tfidf_doc
由 gensim 生成的文档,其中TfidfModel()
包含相应的词袋 document bow_doc
,并且您想过滤 tfidf 值低于cut_percent
该文档中单词百分比的单词,您可以调用tfidf_filter(tfidf_doc, cut_percent)
,然后它将返回一个剪切版本tfidf_doc
:
def tfidf_filter(tfidf_doc, cut_percent):
sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1])
cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1]
#print('before cut:',len(tfidf_doc))
#print('cut value:', cut_value)
for i in range(len(tfidf_doc)-1, -1, -1):
if tfidf_doc[i][1] < cut_value:
tfidf_doc.pop(i)
#print('after cut:',len(tfidf_doc))
return tfidf_doc
然后你想bow_doc
通过结果过滤文档tfidf_doc
,jsut调用filter_bow_by_tfidf(bow_doc, tfidf_doc)
,它将返回剪切版本bow_doc
:
def filter_bow_by_tfidf(bow_doc, tfidf_doc):
bow_idx = len(bow_doc)-1
tfidf_idx = len(tfidf_doc)-1
#print('before :', len(bow_doc))
while True:
if bow_idx < 0: break
if tfidf_idx < 0:
#print('pop2 :', bow_doc.pop(bow_idx))
bow_doc.pop(bow_idx)
bow_idx -= 1
if bow_doc[bow_idx][0] > tfidf_doc[tfidf_idx][0]:
#print('pop1 :', bow_doc.pop(bow_idx))
bow_doc.pop(bow_idx)
bow_idx -= 1
if bow_doc[bow_idx][0] == tfidf_doc[tfidf_idx][0]:
#print('keep :', bow_doc[bow_idx])
bow_idx -= 1
tfidf_idx -= 1
#print('after :', len(bow_doc))
return bow_doc
于 2018-10-31T15:48:59.763 回答