elasticsearch - Elasticsearch 索引清理

Question

v 弹性搜索 5.6.*。

我正在寻找一种方法来实现一种机制，通过该机制，我的一个索引（每天大约有 100 万个文档会迅速增长）来自动管理存储约束。

例如：我将最大文档数或最大索引大小定义为变量“n”。我会编写一个调度程序来检查“n”是否为真。如果为真，那么我想删除最旧的“x”文档（基于时间）。

我在这里有几个问题：

显然，我不想删除太多或太少。我怎么知道'x'是什么？我可以简单地对 elasticsearch 说“嘿，删除价值 5GB 的最旧文档”吗？我的目的是简单地释放固定数量的存储空间。这可能吗？

其次，我想知道这里的最佳做法是什么？显然我不想在这里发明一个方形轮子，如果有任何东西（例如：策展人和我最近才听说过）可以完成这项工作，那么我很乐意使用它。

score 3 · Accepted Answer

在您的情况下，最佳做法是使用基于时间的索引，无论是每日、每周还是每月索引，只要对您拥有的数据量和您想要的保留有意义。您还可以使用Rollover API来决定何时需要创建新索引（基于时间、文档数量或索引大小）

删除整个索引比删除索引中匹配某些条件的文档要容易得多。如果您执行后者，文档将被删除，但在合并底层段之前不会释放空间。而如果您删除整个基于时间的索引，则可以保证释放空间。

score 2 · Accepted Answer

我想出了一个相当简单的 bash 脚本解决方案来清理 Elasticsearch 中基于时间的索引，我想我会分享一下，以防有人感兴趣。Curator 似乎是执行此操作的标准答案，但我真的不想安装和管理 Python 应用程序及其所需的所有依赖项。没有比通过 cron 执行的 bash 脚本更简单的了，而且它在核心 Linux 之外没有任何依赖关系。

#!/bin/bash

# Make sure expected arguments were provided
if [ $# -lt 3 ]; then
    echo "Invalid number of arguments!"
    echo "This script is used to clean time based indices from Elasticsearch. The indices must have a"
    echo "trailing date in a format that can be represented by the UNIX date command such as '%Y-%m-%d'."
    echo ""
    echo "Usage: `basename $0` host_url index_prefix num_days_to_keep [date_format]"
    echo "The date_format argument is optional and defaults to '%Y-%m-%d'"
    echo "Example: `basename $0` http://localhost:9200 cflogs- 7"
    echo "Example: `basename $0` http://localhost:9200 elasticsearch_metrics- 31 %Y.%m.%d"
    exit
fi

elasticsearchUrl=$1
indexNamePrefix=$2
numDaysDataToKeep=$3
dateFormat=%Y-%m-%d
if [ $# -ge 4 ]; then
    dateFormat=$4
fi

# Get the curent date in a 'seconds since epoch' format
curDateInSecondsSinceEpoch=$(date +%s)
#echo "curDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch"

# Subtract numDaysDataToKeep from current epoch value to get the last day to keep
let "targetDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch - ($numDaysDataToKeep * 86400)"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"

while : ; do
    # Subtract one day from the target date epoch
   let "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch - 86400"
   #echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"

   # Convert targetDateInSecondsSinceEpoch into a YYYY-MM-DD format
   targetDateString=$(date --date="@$targetDateInSecondsSinceEpoch" +$dateFormat)
   #echo "targetDateString=$targetDateString"

   # Format the index name using the prefix and the calculated date string
   indexName="$indexNamePrefix$targetDateString"
   #echo "indexName=$indexName"

   # First check if an index with this date pattern exists
    # Curl options:
    #  -s   silent mode. Don't show progress meter or error messages
    #  -w "%{http_code}\n" Causes curl to display the HTTP status code only after a completed transfer.
    #  -I Fetch the HTTP-header only in the response. For HEAD commands there is no body so this keeps curl from waiting on it.
    #  -o /dev/null Prevents the output in the response from being displayed. This does not apply to the -w output though.
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
   #echo "httpCode=$httpCode"
   if [ $httpCode -ne 200 ]
   then
      echo "Index $indexName does not exist. Stopping processing."
      break;
   fi

   # Send the command to Elasticsearch to delete the index. Save the HTTP return code in a variable
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -X DELETE $elasticsearchUrl/$indexName)
   #echo "httpCode=$httpCode"

   if [ $httpCode -eq 200 ]
   then
      echo "Successfully deleted index $indexName."
    else
      echo "FAILURE! Delete command failed with return code $httpCode. Continuing processing with next day."
      continue;
   fi

   # Verify the index no longer exists. Should return 404 when the index isn't found.
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
   #echo "httpCode=$httpCode"
   if [ $httpCode -eq 200 ]
   then
      echo "FAILURE! Delete command responded successfully, but index still exists. Continuing processing with next day."
      continue;
   fi

done

score 1 · Accepted Answer

我在https://discuss.elastic.co/t/elasticsearch-efficiently-cleaning-up-the-indices-to-save-space/137019回答了同样的问题

如果您的索引一直在增长，那么删除文档并不是最佳做法。听起来你有时间序列数据。如果为真，那么您想要的是时间序列指数，或者更好的是翻转指数。

5GB 也是一个很小的清除量，因为单个 Elasticsearch 分片可以健康地增长到 20GB - 50GB 大小。您是否受到存储空间的限制？你有多少个节点？

elasticsearch - Elasticsearch 索引清理

3 回答 3

Related

Reference