我写了函数,然后不得不更改 tf-idf 函数(因为有一个错误),现在当我运行代码时,我不再得到分数,但它也没有失败。我已经调试了好几个小时了,有什么想法吗?
输入如下:
query: dict of titles - {141: ['light', 'brown', 'book'].. (around 7k documents)
doc_dict: dict of documents - {101940009: ['james', 'what',.. (around 7k documents)
tfidf_scr: tf-idf score of the document - {101940009: {}, (101940009, 'alex'): 0.05773409763071451, (101940009, 'watch'): 0.11930283807601677,...
def vectorSpaceModel(query, doc_dict, tfidf_scr):
query_vocab = {}
for word in query:
if word not in query_vocab:
query_vocab[word] = word
query_wc = {}
for word in query_vocab:
for x in query[word]:
query_wc[x] = x.lower().split().count(word)
print(query_wc[x])
relevance_scores = {}
for doc_id in doc_dict.keys():
score = 0
for word in query_vocab:
try:
score += query_wc[word] * tfidf_scr[doc_id][word]
except KeyError:
continue
relevance_scores[doc_id] = round(score, 2)
return top_100_docs(relevance_scores)
我正在对较小的文档集合测试输出,以便可以对其进行处理,但这是输出:
***** VSM *****
DocumentID Score
101940009 0
101940010 0
101940017 0
101940020 0
101940027 0
101940036 0
101940037 0
101940038 0
101940039 0
101940040 0
101940042 0
101940043 0