python - 部分匹配 GAE 搜索 API

Question

使用GAE 搜索 API是否可以搜索部分匹配？

我正在尝试创建自动完成功能，其中该术语将是部分单词。例如。

> b
> bui
> 构建

都会返回“建筑”。

GAE 怎么可能？

score 31 · Accepted Answer

虽然全文搜索不支持 LIKE 语句（部分匹配），但您可以绕过它。

首先，为所有可能的子字符串（hello = h、he、hel、lo 等）标记数据字符串

def tokenize_autocomplete(phrase):
    a = []
    for word in phrase.split():
        j = 1
        while True:
            for i in range(len(word) - j + 1):
                a.append(word[i:i + j])
            if j == len(word):
                break
            j += 1
    return a

使用标记化字符串构建索引 + 文档（搜索 API）

index = search.Index(name='item_autocomplete')
for item in items:  # item = ndb.model
    name = ','.join(tokenize_autocomplete(item.name))
    document = search.Document(
        doc_id=item.key.urlsafe(),
        fields=[search.TextField(name='name', value=name)])
    index.put(document)

执行搜索，哇！

results = search.Index(name="item_autocomplete").search("name:elo")

https://code.luasoftware.com/tutorials/google-app-engine/partial-search-on-gae-with-search-api/

score 3 · Accepted Answer

就像@Desmond Lua 的答案一样，但具有不同的标记化功能：

def 标记化（单词）：
  令牌=[]
  单词 = word.split(' ')
  言归正传：
    对于我在范围内（len（word））：
      如果 i==0：继续
      w = 单词[i]
      如果我==1：
        记号+=[字[0]+w]
        继续

      令牌+=[令牌[-1:][0]+w]

  返回 ",".join(token)

它将解析hello world为he,hel,hell,hello,wo,wor,worl,world.

它适用于轻型自动完成目的

score 2 · Accepted Answer

如全文搜索和 LIKE 语句中所述，不，这是不可能的，因为搜索 API 实现了全文索引。

希望这可以帮助！

score 0 · Accepted Answer

我对 typeahead 控制有同样的问题，我的解决方案是将字符串解析为小部分：

name='hello world'
name_search = ' '.join([name[:i] for i in xrange(2, len(name)+1)])
print name_search;
# -> he hel hell hello hello  hello w hello wo hello wor hello worl hello world

希望这有帮助

score 0 · Accepted Answer

我的版本优化：不重复标记

def tokenization(text):
    a = []
    min = 3
    words = text.split()
    for word in words:
        if len(word) > min:
            for i in range(min, len(word)):
                token = word[0:i]
                if token not in a:
                    a.append(token)
    return a

score 0 · Accepted Answer

在这里跳得很晚。

但这是我有据可查的标记化功能。文档字符串应该可以帮助您很好地理解并使用它。祝你好运！！！

def tokenize(string_to_tokenize, token_min_length=2):
  """Tokenizes a given string.

  Note: If a word in the string to tokenize is less then
  the minimum length of the token, then the word is added to the list
  of tokens and skipped from further processing.
  Avoids duplicate tokens by using a set to save the tokens.
  Example usage:
    tokens = tokenize('pack my box', 3)

  Args:
    string_to_tokenize: str, the string we need to tokenize.
    Example: 'pack my box'.
    min_length: int, the minimum length we want for a token.
    Example: 3.

  Returns:
    set, containng the tokenized strings. Example: set(['box', 'pac', 'my',
    'pack'])
  """
  tokens = set()
  token_min_length = token_min_length or 1
  for word in string_to_tokenize.split(' '):
    if len(word) <= token_min_length:
      tokens.add(word)
    else:
      for i in range(token_min_length, len(word) + 1):
        tokens.add(word[:i])
  return tokens

python - 部分匹配 GAE 搜索 API

6 回答 6

Related

Reference