elasticsearch - Elasticsearch：如何存储术语向量

Question

我正在做一个项目，我大量使用 Elasticsearch 并利用moreLikeThis查询来实现一些功能。MLT 查询的官方文档说明如下：

为了加快分析速度，它可以帮助在索引时存储术语向量，但会以磁盘使用为代价。

在**如何工作*部分。现在的想法是调整映射以存储预先计算的术语向量。问题是，从文档中似乎不清楚应该如何做到这一点。一方面，在MLT文档中，它们提供了如下所示的示例映射：

curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
  "mappings": {
    "movies": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "yes"
         },
         "description": {
          "type": "string"
        },
        "tags": {
          "type": "string",
          "fields" : {
            "raw": {
              "type" : "string",
              "index" : "not_analyzed",
              "term_vector" : "yes"
            }
          }
        }
      }
    }
  }
}

另一方面，在术语向量文档中，它们在示例 1部分中提供了一个如下所示的映射

curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "index_analyzer" : "fulltext_analyzer"
        }
      }
    }
    ....

这应该create an index that stores term vectors, payloads etc.

现在的问题是：应该使用哪个映射？这是文档中的缺陷还是我遗漏了什么？

score 10 · Accepted Answer

您是对的，当前版本的文档中似乎没有明确提及，但是在即将发布的2.0 版文档中有更详细的解释。

术语向量包含有关分析过程产生的术语的信息，包括：

术语列表。

每个术语的位置（或顺序）。

将术语映射到其在原始字符串中的原点的开始和结束字符偏移量。

可以存储这些术语向量，以便可以针对特定文档检索它们。

该term_vector设置接受：

no：不存储术语向量。（默认）

yes：只存储字段中的术语

with_positions：存储条款和位置

with_offsets：存储术语和字符偏移量

with_positions_offsets：存储术语、位置和字符偏移量

elasticsearch - Elasticsearch：如何存储术语向量

1 回答 1

Related

Reference