elasticsearch - Elasticsearch 重要术语聚合

Question

我已经开始使用重要术语聚合来查看与我索引的整个文档集相比，哪些关键字在文档组中是重要的。

在很多文档被索引之前，它都很好用。然后对于曾经有效的相同查询，elasticsearch 只说：

 SearchPhaseExecutionException[Failed to execute phase [query], 
 all shards failed; shardFailures {[OIWBSjVzT1uxfxwizhS5eg][demo_paragraphs][0]:
 CircuitBreakingException[Data too large, data for field [text] 
 would be larger than limit of [633785548/604.4mb]];

我的查询如下所示：

 POST /demo_paragraphs/_search
 {
     "query": {
         "match": {
            "django_target_id": 1915661
         }
     },
     "aggregations" : {
         "signKeywords" : {
             "significant_terms" : {
                 "field" : "text"
             }
         }
     }
 }

以及文件结构：

        "_source": {
           "django_ct": "citations.citation",
           "django_target_id": 1915661,
           "django_id": 3414077,
           "internal_citation_id": "CR7_151",
           "django_source_id": 1915654,
           "text": "Mucin 1 (MUC1) is a protein heterodimer that is overexpressed in lung cancers [6]. MUC1 consists of two subunits, an N-terminal extracellular subunit (MUC1-N) and a C-terminal transmembrane subunit (MUC1-C). Overexpression of MUC1 is sufficient for the induction of anchorage independent growth and tumorigenicity [7]. Other studies have shown that the MUC1-C cytoplasmic domain is responsible for the induction of the malignant phenotype and that MUC1-N is dispensable for transformation [8]. Overexpression of",
           "id": "citations.citation.3414077",
           "num_distinct_citations": 0
        }

我索引的数据是科学论文的段落。没有文件真的很大。

关于如何分析或解决问题的任何想法？

score 2 · Accepted Answer

如果数据集太大而无法在一台机器上计算结果，您可能需要多个节点。

在计划分片分发时要深思熟虑。确保分片正确分布，以便在计算繁重的查询时每个节点都受到同样的压力。大型数据集的一个很好的拓扑结构是主数据搜索配置，其中您有一个充当主节点的节点（没有数据，没有在该节点上运行的查询）。一些节点专用于保存数据（分片），一些节点专用于执行查询（它们不保存数据，它们使用数据节点执行部分查询并组合结果）。首先，Netflix 正在使用此拓扑Netflix raigad 在此处输入图像描述

Paweł Róg 是对的，您将需要更多的 RAM。首先，增加每个节点可用的 java 堆大小。有关详细信息，请参阅此站点：ElasticSearch 配置您必须重新搜索多少 RAM 才足够。有时过多的 RAM 实际上会减慢 ES（除非它已在最近的版本之一中修复）。

score 0 · Accepted Answer

0

我认为有简单的解决方案。请给 ES 更多 RAM :D 聚合需要大量内存。

于 2014-11-25T07:55:30.977 回答

score 0 · Accepted Answer

请注意，在 elasticsearch 6.0 中出现了significant_text不需要字段数据的新聚合。请参阅https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-significanttext-aggregation.html

elasticsearch - Elasticsearch 重要术语聚合

3 回答 3

Related

Reference