我目前正在研究 Elasticsearch,索引中有大量文档(大约 500K)。我想将每个文档的文本数据的 n-gram 存储在另一个索引中(这也很大〜每个文档包含 2 页文本数据)。因此,我计算每个文档中的术语向量及其计数,以将它们存储在新索引中。所以我可以对新索引执行聚合查询。
旧索引的设置使我能够执行termvector和mtermvector API。我不想在短时间内向 Elasticsearch 服务器发送太多请求,所以我将使用 mtermvectors python API。我试图通过传递 25 个文档的 id 来获取 25 个文档的术语向量。
在 python 中调用 mtermvector API 后的 HTTP URL 示例
http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false
有时我会得到预期的响应,但大多数时候我会收到以下错误:
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.
Reason: Error reading from remote server
索引设置和映射
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "letter_tokenizer",
"filter": [
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"length_filter"
]
}
},
"filter": {
"custom_stemmer": {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "4",
"filler_token":""
},
"length_filter": {
"type": "length",
"min": 2
}
},
"tokenizer": {
"letter_tokenizer": {
"type": "letter"
}
}
}
},
"mappings": {
"properties": {"article_id":{"type": "text"},
"plain_text": {
"term_vector": "with_positions_offsets_payloads",
"store": true,
"analyzer": "shingleAnalyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
我认为此设置和映射没有任何问题,因为有时我会得到预期的响应。
如果您需要我方面的更多信息,请告诉我。任何帮助将不胜感激。