1

我目前正在研究 Elasticsearch,索引中有大量文档(大约 500K)。我想将每个文档的文本数据的 n-gram 存储在另一个索引中(这也很大〜每个文档包含 2 页文本数据)。因此,我计算每个文档中的术语向量及其计数,以将它们存储在新索引中。所以我可以对新索引执行聚合查询。

旧索引的设置使我能够执行termvectormtermvector API。我不想在短时间内向 Elasticsearch 服务器发送太多请求,所以我将使用 mtermvectors python API。我试图通过传递 25 个文档的 id 来获取 25 个文档的术语向量。

在 python 中调用 mtermvector API 后的 HTTP URL 示例

http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false

有时我会得到预期的响应,但大多数时候我会收到以下错误:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.

Reason: Error reading from remote server

索引设置和映射

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "letter_tokenizer",
          "filter": [
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer",
            "length_filter"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "name": "english"
        },
        "custom_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_shingle": {
          "type": "shingle",
          "min_shingle_size": "2",
          "max_shingle_size": "4",
          "filler_token":""
        },
        "length_filter": {
          "type": "length",
          "min": 2
        }
      },
      "tokenizer": {
        "letter_tokenizer": {
          "type": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {"article_id":{"type": "text"},
      "plain_text": {
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "analyzer": "shingleAnalyzer",
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

我认为此设置和映射没有任何问题,因为有时我会得到预期的响应。

如果您需要我方面的更多信息,请告诉我。任何帮助将不胜感激。

4

0 回答 0