1

我有一个 GKE 集群,我想跟踪请求的总内存与可分配的总内存之间的比率。我能够使用在 Google Cloud Monitoring 中创建图表

metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"

metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"

两者都crossSeriesReducer设置为REDUCE_SUM以获得整个集群的总和。

然后,当我尝试使用两者的比率(如下)设置警报策略(使用云监控 api)时,我收到此错误

ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

它不喜欢第一个指标是 ak8s_container而第二个指标是 ak8s_node我可以使用不同的指标或某种解决方法来提醒 Google Cloud Monitoring 中的内存请求/可分配比率吗?

编辑:

这是完整的请求和响应

$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

$ cat policy.json
{
    "displayName": "Cluster Memory",
    "enabled": true,
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Ratio: Memory Requests / Memory Allocatable",
            "conditionThreshold": {
                 "filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
                 "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "crossSeriesReducer": "REDUCE_SUM",
                        "groupByFields": [
                        ],
                        "perSeriesAligner": "ALIGN_MEAN"
                    }
                ],
                "denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
                "denominatorAggregations": [
                   {
                      "alignmentPeriod": "60s",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": [
                       ],
                      "perSeriesAligner": "ALIGN_MEAN",
                    }
                ],
                "comparison": "COMPARISON_GT",
                "thresholdValue": 0.8,
                "duration": "60s",
                "trigger": {
                    "count": 1
                }
            }
        }
    ]
}
4

1 回答 1

2
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

以下官方文档:

groupByFields[] - 参数

crossSeriesReducer指定时要保留的字段集。在groupByFields应用聚合操作之前确定如何将时间序列划分为子集。每个子集包含对每个分组字段具有相同值的时间序列。每个单独的时间序列都是一个子集的成员。crossSeriesReducer应用于时间序列的每个子集。不可能跨不同的资源类型减少,因此该字段隐含包含resource.type. 中未指定的字段将groupByFields被聚合。如果groupByFields未指定且所有时间序列具有相同的资源类型,则时间序列将聚合为单个输出时间序列。如果crossSeriesReducer未定义,则忽略此字段。

-- Cloud.google.com:监控:projects.alertPolicies

请具体看一下:

不可能跨不同的资源类型减少,因此该字段隐含包含resource.type.

当您尝试创建具有不同资源类型的策略时,会显示上述错误。

下面显示的指标有Resource type

  • kubernetes.io/container/memory/request_bytes-k8s_container
  • kubernetes.io/node/memory/allocatable_bytes-k8s_node

您可以Resource type通过查看以下指标来检查GCP Monitoring

容器

节点

作为一种解决方法,您可以尝试创建一个警报策略,当内存的可分配利用率高于 85% 时会提醒您。它会间接告诉您请求的内存足够高以触发警报。

下面使用 YAML 的示例:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - resource.label.cluster_name
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="GKE-CLUSTER-NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
    [SUM]
  name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/

示例GCP Monitoring web access

GCP 监控指标网站

如果您对此有任何疑问,请告诉我。

编辑:

要正确创建将显示相关数据的警报策略,您需要考虑很多因素,例如:

  • 工作量类型
  • 节点和节点池的数量
  • 节点亲和性(例如:在 GPU 节点上产生某种类型的工作负载)
  • ETC

对于将考虑每个节点池的可分配内存的更高级的警报策略,您可以执行以下操作:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - metadata.user_labels."cloud.google.com/gke-nodepool"
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="CLUSTER_NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
  mutateTime: '2020-03-31T18:03:20.325259198Z'
  mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T18:18:57.169590414Z'
  mutatedBy: XXX@YYY.ZZZ

请注意有一个错误:Groups.google.com:Google Stackdriver 讨论,创建上述警报策略的唯一可能性是使用命令行。

于 2020-03-31T08:57:21.037 回答