我有一个嵌套 python 字典形式的完整倒排索引。它的结构是:
{word:{ doc_name:[location_list]}}
例如让字典被称为索引,那么对于一个单词“垃圾邮件”,条目将如下所示:
{垃圾邮件:{ doc1.txt:[102,300,399],doc5.txt:[200,587]}}
因此,包含任何单词的文档可以由index[word].keys()给出,并且该文档中的频率由len(index[word][document])
现在我的问题是,如何在这个索引中实现正常的查询搜索。即给定一个包含让我们说 4 个单词的查询,查找包含所有四个匹配项的文档(按总出现频率排名),然后查找包含 3 个匹配项的文档,依此类推....
**
使用 S. Lott 的答案添加了此代码。这是我写的代码。它完全按照我的意愿工作,(只需要一些输出格式)但我知道它可以改进。
**
from collections import defaultdict
from operator import itemgetter
# Take input
query = input(" Enter the query : ")
# Some preprocessing
query = query.lower()
query = query.strip()
# now real work
wordlist = query.split()
search_words = [ x for x in wordlist if x in index ] # list of words that are present in index.
print "\nsearching for words ... : ", search_words, "\n"
doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
for p in d:
doc_words[p].append(w)
# create a dictionary identifying matches for each document
result_set = {}
for i in doc_words.keys():
count = 0
matches = len(doc_words[i]) # number of matches
for w in doc_words[i]:
count += len(index[w][i]) # count total occurances
result_set[i] = (matches,count)
# Now print in sorted order
print " Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
print doc, "\t",doc_words[doc],"\t",count
请评论....谢谢。