django - Xapian 搜索词超过 245 个字符长度：InvalidArgumentError: Term too long (> 245)

Question

我在我的 django 应用程序中使用 Xapian 和 Haystack。我有一个模型，其中包含一个我想要索引以进行搜索的文本字段。该字段用于存储各种字符：单词、url、html 等。

我正在使用默认的基于文档的索引模板：

text = indexes.CharField(document=True, use_template=True)

当有人粘贴特别长的链接时，这有时会产生以下错误：

InvalidArgumentError: Term too long (> 245)

现在我明白了这个错误。我以前在其他情况下的其他领域已经解决了这个问题。

我的问题是，处理此异常的首选方法是什么？

处理这个异常似乎需要我使用 prepare_text() 方法：

def prepare_text(self, obj):
    content = []      
    for word in obj.body.split(' '):
        if len(word) <= 245:
            content += [word]
    return ' '.join(content)

它看起来很笨重并且容易出现问题。另外我不能使用搜索模板。

你是如何处理这个问题的？

score 0 · Accepted Answer

我认为你是对的。inkscape xapian_backend fork 上有一个补丁，灵感来自 xapian omega 项目。

我做了一些你在我的项目中做过的事情，为了使用搜索索引模板有一些技巧：

# regex to efficiently truncate with re.sub
_max_length = 240
_regex = re.compile(r"([^\s]{{{}}})([^\s]+)".format(_max_length))

def prepare_text(self, object):

    # this is for using the template mechanics
    field = self.fields["text"]
    text = self.prepared_data[field.index_fieldname]

    encoding = "utf8"
    encoded = text.encode(encoding)

    prepared = re.sub(_regex, r"\1", encoded, re.UNICODE)

    if len(prepared) != len(encoded):
        return prepared.decode(encoding, 'ignore')

    return text

django - Xapian 搜索词超过 245 个字符长度：InvalidArgumentError: Term too long (> 245)

1 回答 1

Related

Reference