monitoring - Naming statsd metrics for short lived streams

Question

I am trying to model statistics to submit to statsd/graphite. However what I am monitoring is "session" centric. For example, I have a game that is played in real time. There are multiple instances of a game active on the servers. Each game has multiple (and variable number of) participants. Each instance of a game has a unique ID as does each player. I want to track (and graph) each player's stats but then roll the metric up for the whole instance and then for all the instances of a game. For example there may be two instances of a game active at a given time. Lets say each has two players in the game

GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_1 10
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_2 20
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_3 50
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_4 70

where game_instances and player_ids are 128 bit numbers

And I want to be able to see that the value of all voice errors for game_instance_a is 30 while all voice errors across the system is 150

Given this I have three questions

What guidance would you have on naming the metrics.
Is it kosher to have metrics that have "dynamic" identifiers as part of the name
What are they scale limits on this. If I had a 100K game instances with say as many as 1000 players in a game, is this going to kill statsd/graphite?

Thanks!

score 1 · Accepted Answer

在命名指标时，您会给出什么指导？

Graphite 建议“Volatile path components 应该尽可能深入到层次结构中”。这实质上意味着，如果您可以将经常唯一的指标部分推送到“桶”的末尾而不影响您的分组查询，您应该尝试这样做。

这是一篇关于使用 Graphite 的精彩文章，其中包括命名建议。这是另一个来自 Jason Dixon 的附加信息（一般来说，石墨材料的绝佳来源）。

将具有“动态”标识符作为名称一部分的度量标准是否符合规定？

我通常会尽量避免在指标名称中使用标识符，除非它们的数量非常少（<100）。因为Graphite 将为每个度量名称存储一个 .wsp 文件，所以如果您决定更改配置，您将很难重新调整大小或调整存储设置。此外，Graphite UI 将为每个指标名称提供一个“文件夹”，因此您可以轻松地使 UI 无法使用。

在您的情况下，我可能会绘制游戏实例总数、玩家总数和错误数（按类型）等图表。此外，我可能会尝试跟踪每个实例的玩家（通常），也许还有错误每个实例（再次在不知道实际实例的情况下。例如 GameTitle.RealTime.PerInstance.VoiceErrors），如果我有该功能（即在我的应用程序中存储每个实例的状态）。

Logstash、弹性搜索、Kibana

我建议使用实例和播放器 ID 记录此错误信息，并使用logstash将您的日志发送到elastic search 和 kibana。然后我会观看 Graphite 进行实时错误和健康异常检测，并使用 Kibana（以及下面的 Elastic Search）进行更深入的挖掘。

这有什么规模限制。如果我有一个 100K 的游戏实例，一个游戏中有多达 1000 名玩家，这会杀死 statsd/graphite 吗？

Statsd 对此应该没有问题，因为它只是充当 - 主要是愚蠢的聚合器。虽然它确实在内部保持某种状态，但我预计不会出现问题。

我认为内部 Graphite Whisper Storage 本身不会有问题，因为它只是使用文件和文件夹。但是，正如我上面提到的，Graphite Web UI 将无法使用，我认为您还会面临其他可管理性问题的风险。

概括

在名称末尾保留易失（动态）度量桶，并避免超过几百个。

monitoring - Naming statsd metrics for short lived streams

1 回答 1

在命名指标时，您会给出什么指导？

将具有“动态”标识符作为名称一部分的度量标准是否符合规定？

这有什么规模限制。如果我有一个 100K 的游戏实例，一个游戏中有多达 1000 名玩家，这会杀死 statsd/graphite 吗？

概括

Related

Reference