python - 使用python的大型数据集中的向量支持模型

Question

我写了函数，然后不得不更改 tf-idf 函数（因为有一个错误），现在当我运行代码时，我不再得到分数，但它也没有失败。我已经调试了好几个小时了，有什么想法吗？

输入如下：

query: dict of titles - {141: ['light', 'brown', 'book'].. (around 7k documents)
doc_dict: dict of documents - {101940009: ['james', 'what',.. (around 7k documents)
tfidf_scr: tf-idf score of the document - {101940009: {}, (101940009, 'alex'): 0.05773409763071451, (101940009, 'watch'): 0.11930283807601677,...

def vectorSpaceModel(query, doc_dict, tfidf_scr):
    query_vocab = {}
    for word in query:
        if word not in query_vocab:
            query_vocab[word] = word
    query_wc = {}
    for word in query_vocab:
        for x in query[word]:
            query_wc[x] = x.lower().split().count(word)
            print(query_wc[x])
    relevance_scores = {}
    for doc_id in doc_dict.keys():
        score = 0
        for word in query_vocab:
            try:
                score += query_wc[word] * tfidf_scr[doc_id][word]
            except KeyError:
                continue
        relevance_scores[doc_id] = round(score, 2)
    return top_100_docs(relevance_scores)

我正在对较小的文档集合测试输出，以便可以对其进行处理，但这是输出：

***** VSM *****
DocumentID      Score
101940009       0
101940010       0
101940017       0
101940020       0
101940027       0
101940036       0
101940037       0
101940038       0
101940039       0
101940040       0
101940042       0
101940043       0

python - 使用python的大型数据集中的向量支持模型

0 回答 0

Related

Reference