parsing - 使用 whoosh 对大型索引进行高效布尔搜索

Question

我创建了一个包含字段（id、title、url、content）的索引，用于通过爬网存储网页信息。现在我想用多个单词查询（也是布尔查询）来搜索那个索引，建议好的 n 高效的搜索算法（一些例子）和高效的解析。请帮忙

score 0 · Accepted Answer

你想只搜索标题还是内容？假设您希望允许对返回 URL 和/或内容的标题进行部分搜索，架构将是：

 schema = Schema(id=ID(stored=True), title=NGRAM(minsize=2, maxsize=20,stored=True, sortable=ranking_col), url=STORED(), content=STORED())

这适用于最多约 1000000 个标题的标准 Whoosh 搜索器。对于更多条目，ngram 索引将变得非常大且缓慢。

此外，使用停用词来减小索引大小：

stopwords = set(['of', 'by', 'the','in','for','a']) #words to be excluded from the index    
def create_whoosh(self):
    writer = ix.writer()
    for t in documents:
        words = [t for t in t.title.split(" ") if t not in stopwords]  #remove stopwords
        writer.add_document(title=" ".join(words), url=t.url, content=t.content)
    writer.commit()

搜索者：

def lookup(self, terms):
 with ix.searcher() as src:
        query = QueryParser("term", ix.schema).parse(terms)
        results = src.search(query, limit=30)
        return  [[r['url'],r['content']] for r in results]

如果要搜索标题和内容中的完整单词，您可以执行以下操作：

 schema = Schema(id=ID(stored=True), title=TEXT(stored=True), url=STORED(), content=TEXT(stored=True))

这不适用于子字符串搜索，但可以很好地处理数百万个文档（取决于内容的大小）

为了索引约 1000 万份文档，您需要将内容单独存储在某种数据库中，并仅使用 whoosh 查找 ID。

parsing - 使用 whoosh 对大型索引进行高效布尔搜索

1 回答 1

Related

Reference