json - 在句子（句子数组）中的任何位置找到确切的两个单词 elastcsearch 6.8

Question

我正在尝试为 elasticsearch 编写一个自定义原始查询，我需要在包含多个由空格分隔的 ID 的字符串中组合 ID 进行搜索。

搜索字段如下所示：

文件 1

"sentence": [
             "1060 1764 1769 1770 1772 2807 2808 3570", 
             "1101 3402 3403",
             "1101 1764 1769 1770 1772",
             "1001 1060 1099 1100 1101 2806 2807 2808 3570"
            ]

文件 2

"sentence": [
             "1060 2806 2807 2808 3570", 
             "1101 3402 3403",
             "1101 1764 1769 1770 1772",
             "1001 1060 1488 1489 1490 2806 2807 2808 3570"
            ]

例如，当使用参数“1060 和 1101”搜索时，它应该只返回文档 1，因为它在单个字符串中包含这两个值。尽可能避免使用嵌套查询。

尝试使用 bool 必须匹配查询、匹配短语查询、查询字符串、简单查询字符串、bool 必须匹配过滤器术语查询、正则表达式组合。一切都返回了一些东西，但不完全是我需要的。

score 0 · Accepted Answer

您的问题的根源在于您并不真正了解数组如何在弹性中工作，来自文档：

没有专用的数组数据类型。

这意味着当您索引一个数组（不是类型为嵌套的）时，由于弹性“扁平化”数组这一事实，您将失去查询其中单个项目的能力。

你有两个选择：

更合适的解决方案是重新索引您的数据，使要键入的句子嵌套，然后您可以单独查询每个项目。

新结构将如下所示：

            {
                "mappings": {
                    "doc": {
                        "properties": {
                            "sentence": {
                                "type": "nested",
                                "properties": {
                                    "value": {
                                        "type": "text"
                                    }
                                }
                            }
                        }
                    }
                }
            }

但是，由于我不熟悉您的产品和需求，因此该解决方案可能不适合您，因为这可能会影响您已经使用的许多其他查询。

因此选项编号 2。使用脚本过滤掉文档：

（这个脚本是我制作的一个快速示例，您可以编写一个更有效的版本来优化运行时，假设许多文档不会包含您查询的任何这些术语，添加一个query（类似于您一直在做的）将是有效的在filter迭代“可疑”匹配的操作之前。）

{
        "query": {
            "bool": {
                 // the must is optional and only here to filter out documents that are not relevant, you should test this on your data to see if its needed.
                 "must": {"query_string": {"default_field": "sentence", "query": "1060 AND 1101"}},
                "filter": {
                    "script": {
                        "script": {
                            "lang": "painless",
                            "source": `                     
                                boolean matched = false;
                                String[] queries = new String[] {'1060', '1101'};
                                for (int i = 0; i < doc['sentence.keyword'].length; i++) {
                                    int count = 0;
                                    for (int j = 0; j < queries.length; j++) {
                                        if (doc['sentence.keyword'][i].indexOf(queries[j]) > -1) {
                                            count += 1;
                                        }
                                    }
                                    if (count === queries.length) {
                                        matched = true;
                                    }
                                }
                                return matched
                                `
                        }
                    }
                }

            }
        }
    }

正如我之前所说，选项 2 是不太“合适”的解决方案，而且效率要低得多。但如果需要，它是一个有效的。

score 0 · Accepted Answer

1）使用此映射为您的字段重新索引您的数据。嵌套字段（数组）的每个元素都是一个句子。由于您仅使用数字列表，我会将它们存储为字符串，升级可能是使用自定义分析器以确保索引（但如果您继续使用简单整数，则不是强制性的）

"sentence": {
    "type": "nested",
        "properties": {
            "sentencearray": {
                "type": "text"
            }
        }
    }

2) 使用嵌套查询进行查询

{
  "query": {
    "nested": {
      "path": "sentence",
      "query": {
        "bool": {
          "must": [
            { "match": { "sentence.sentencearray": "1060" }},
            { "match": { "sentence.sentencearray":  "1101" }} 
          ]
        }
      }
    }
  }
}

3）过滤结果，只保留匹配的嵌套元素，在查询中添加inner_hist：

{
  "query": {
    "nested": {
      "path": "yourfield",
      "query": {
        "bool": {
          "must": [
            { "match": { "yourfield.yourarray": "1012" }},
            { "match": { "yourfield.yourarray":  "1024" }} 
          ]
        }
      },
      "inner_hits":{}
    }
  }
}

score 0 · Accepted Answer

你不需要嵌套数组。这是一个带有短语匹配的工作示例（和一个大问题！）

# no special mapping needed
PUT stackoverflow-58283078

POST stackoverflow-58283078/_doc
{
  "sentence": [
             "1060 1764 1769 1770 1772 2807 2808 3570", 
             "1101 3402 3403",
             "1101 1764 1769 1770 1772",
             "1001 1060 1099 1100 1101 2806 2807 2808 3570"
            ]
}


POST stackoverflow-58283078/_doc
{
  "sentence": [
    "1060 2806 2807 2808 3570",
    "1101 3402 3403",
    "1101 1764 1769 1770 1772",
    "1001 1060 1488 1489 1490 2806 2807 2808 3570"
  ]
}

POST stackoverflow-58283078/_search
{
  "query": {
    "match_phrase": {
      "sentence": {
        "query":  "1060 1101",
        "slop": 20
      }
    }
  }
}

此查询返回：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.16809675,
    "hits" : [
      {
        "_index" : "stackoverflow-58283078",
        "_type" : "_doc",
        "_id" : "4bsgrG0BWf1JU_OTT9FV",
        "_score" : 0.16809675,
        "_source" : {
          "sentence" : [
            "1060 1764 1769 1770 1772 2807 2808 3570",
            "1101 3402 3403",
            "1101 1764 1769 1770 1772",
            "1001 1060 1099 1100 1101 2806 2807 2808 3570"
          ]
        }
      }
    ]
  }
}

为什么？因为短语匹配在令牌的“倾斜”半径内搜索令牌。由于默认的“ position_increment_gap ”是 100，它在不同的值之间不匹配。

你有一个句子中的最大标记数吗？例如，如果你想处理 5000 个令牌，你可以配置一个 5001 的斜率（允许第一个和最后一个令牌之间的反转，相信我这个：p）和一个position_increment_gap优于 5002 的值。

json - 在句子（句子数组）中的任何位置找到确切的两个单词 elastcsearch 6.8

3 回答 3

Related

Reference