python - ElasticSearch：EdgeNgrams 和数字

Question

关于 EdgeNgram 如何处理数字的任何想法？

我正在使用 ElasticSearch 后端运行 haystack。我创建了一个 EdgeNgram 类型的索引字段。该字段将包含一个字符串，该字符串可能包含单词和数字。

当我使用部分单词对该字段进行搜索时，它会按预期工作。但是如果我输入一个部分数字，我不会得到我想要的结果。

例子：

我通过键入“edgen”来搜索索引字段“EdgeNgram 12323”，然后我会得到返回给我的索引。如果我通过输入“123”来搜索相同的索引，我什么也得不到。

想法？

score 4 · Accepted Answer

我在这里找到了尝试在 Haystack + Elasticsearch 中解决同样问题的方法。根据 uboness 和 ComoWhat 的提示，我编写了一个替代的 Haystack 引擎，（我相信）它使 EdgeNGram 字段将数字字符串视为单词。其他人可能会受益，所以我想我会分享它。

from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine, ElasticsearchSearchBackend

class CustomElasticsearchBackend(ElasticsearchSearchBackend):
    """
    The default ElasticsearchSearchBackend settings don't tokenize strings of digits the same way as words, so emplids
    get lost: the lowercase tokenizer is the culprit. Switching to the standard tokenizer and doing the case-
    insensitivity in the filter seems to do the job.
    """
    def __init__(self, connection_alias, **connection_options):
        # see http://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
        super(CustomElasticsearchBackend, self).__init__(connection_alias, **connection_options)

class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
    backend = CustomElasticsearchBackend

score 3 · Accepted Answer

如果您使用的是 edgeNGram 标记器，那么它会将“EdgeNGram 12323”视为单个标记，然后对其应用 edgeNGram'ing 过程。例如，如果 min_grams=1 max_grams=4，您将获得以下标记：["E", "Ed", "Edg", "Edge"]。所以我想这不是你真正想要的 - 考虑使用 edgeNGram 令牌过滤器：

如果您使用 edgeNGram 标记过滤器，请确保您使用的标记器实际标记文本“EdgeNGram 12323”以从中生成两个标记：[“EdgeNGram”、“12323”]（标准或空白标记器将做的伎俩）。然后在它旁边应用 edgeNGram 过滤器。

一般来说，edgeNGram 会取“12323”并产生诸如“1”、“12”、“123”等标记...

python - ElasticSearch：EdgeNgrams 和数字

2 回答 2

Related

Reference