因此,与按环境和功能划分的组相比,我试图找到抛出异常多的机器。直觉是整个组的负载和任务类型应该非常相似,所以如果一台机器抛出更多异常,它可能处于某种不良状态并且应该得到服务。
这对于大型机器组相当有效,但对于较小的组则存在一个问题:如果机器很少,并且只有其中一台抛出大量异常,则可能无法检测到它。原因是因为该数据点是该组的一般 stddev 和均值计算的一部分,所以均值和 stddev 偏向于该异常值。
解决方案是以某种方式从计算的 stddev 和整个组的平均值中减去该数据点,或者计算每个机器/环境/功能组合的 stddev 和平均值(从 stddev/mean 计算中排除有问题的机器)而不仅仅是环境/功能组。
这是通过环境/功能执行此操作的当前代码。是否有一个优雅的解决方案来扩展它来完成机器/环境/功能?
// Find sick machines
let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) {
// These are the exceptions we are looking at (time window constrained)
let Exceptions = exception
| where EventInfo_Time between((AtTime - TimeWindow ) .. AtTime);
// Calculate mean and stddev for each bin of environmentName + machineFunction
let MeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| summarize avg(count_), stdev(count_) by environmentName, machineFunction
| order by environmentName, machineFunction;
let MachinesWithMeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| join kind=fullouter MeanAndStdDev on environmentName, machineFunction;
let SickMachines = MachinesWithMeanAndStdDev |
project machineName,
machineFunction,
environmentName,
totalExceptionCount = count_,
cutoff = avg_count_ + Sigmas * stdev_count_,
signalStrength = ((count_ - avg_count_) / stdev_count_)
| where totalExceptionCount > cutoff and totalExceptionCount > MinimumExceptionsToTrigger
| order by signalStrength desc;
SickMachines
}