python - 分析趋势并发现异常行为

Question

我正在创建一个用于记录传感器数据的系统。（只是一串数字）

我希望能够将系统置于“学习”模式几天，这样它就可以看到它的“正常”操作值是什么，并且一旦超出这个范围，任何偏离此行为的任何偏差都超过了某个点可以标记。数据全部存储在 MySQL 数据库中。

欢迎任何有关如何执行此操作的建议，以及进一步阅读该主题的位置。

我最好使用 python 来完成这项任务。

在白天访问和使用的温度控制区域中每 5 分钟的数据温度和湿度值。这意味着它在使用时会有波动和一些温度变化。但是需要检测与此不同的任何东西，例如冷却或加热系统故障

score 1 · Accepted Answer

本质上，您应该关注的是密度估计：确定某些变量如何表现的模型的任务，以便您可以寻找与它的偏差。

这是一些非常简单的示例代码。我假设温度和湿度在其未转换的尺度上具有独立的正态分布：

import numpy as np
from matplotlib.mlab import normpdf
from itertools import izip

class TempAndHumidityModel(object):
    def __init__(self):
        self.tempMu=0
        self.tempSigma=1
        self.humidityMu=0
        self.humiditySigma=1

    def setParams(self, tempMeasurements, humidityMeasurements, quantile):
        self.tempMu=np.mean(tempMeasurements)
        self.tempSigma=np.std(tempMeasurements)
        self.humidityMu=np.mean(humidityMeasurements)
        self.humiditySigma=np.std(humidityMeasurements)

        if not 0 < quantile <= 1:
            raise ValueError("Quantile for threshold must be between 0 and 1")

        self._thresholdDensity(quantile, tempMeasurements, humidityMeasurements)

    def _thresholdDensity(self, quantile, tempMeasurements, humidityMeasurements):
        tempDensities = np.apply_along_axis(
            lambda x: normpdf(x, self.tempMu, self.tempSigma),0,tempMeasurements)
        humidityDensities = np.apply_along_axis(
            lambda x: normpdf(x, self.humidityMu, self.humiditySigma),0,humidityMeasurements)

        densities = sorted(tempDensities * humidityDensities, reverse=True)
        #Here comes the massive oversimplification: just choose the
        #density value at the quantile*length position, and use this as the threshold
        self.threshold = densities[int(np.round(quantile*len(densities)))]

    def probOfObservation(self, temp, humidity):
        return normpdf(temp, self.tempMu, self.tempSigma) * \
               normpdf(humidity, self.humidityMu, self.humiditySigma)

    def isNormalMeasurement(self, temp, humidity):
        return self.probOfObservation(temp, humidity) > self.threshold

if __name__ == '__main__':
    #Create some simulated data
    temps = np.random.randn(100)*10 + 50
    humidities = np.random.randn(100)*2 + 10

    thm = TempAndHumidityModel()
    #going to hard code in the 95% threshold
    thm.setParams(temps, humidities, 0.95) 

    #Create some new data from same dist and see how many false positives
    newTemps = np.random.randn(100)*10 + 50
    newHumidities = np.random.randn(100)*2 + 10

    numFalseAlarms = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(newTemps,newHumidities))
    print '{} false alarms!'.format(numFalseAlarms)

    #Now create some abnormal data: mean temp drops to 20
    lowTemps = np.random.randn(100)*10 + 20
    normalHumidities = np.random.randn(100)*2 + 10

    numDetections = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(lowTemps,normalHumidities))
    print '{} abnormal measurements flagged'.format(numDetections)

示例输出：

>> 3 false alarms!
>> 77 abnormal measurements flagged

现在，我不知道正态性假设是否适合您的数据（您可能希望将数据转换为不同的比例，以便它适合）；假设温度和湿度之间的独立性可能非常不准确；并且我用来查找与请求的分布分位数相对应的密度值的技巧应该被使用分布的逆 CDF 的东西代替。但是，这应该让您了解该做什么。

另外请注意，有许多很好的非参数密度估计器：核密度估计器立即浮现在脑海。如果您的数据看起来不像任何标准分布，这些可能更合适。

score 0 · Accepted Answer

您似乎正在尝试执行异常检测，但您对数据的描述含糊不清。通常，您应该首先尝试定义/限制数据“正常”的含义。

每个传感器是否有不同的“正常”？
传感器测量是否以某种方式依赖于它之前的测量？
“正常”会在一天内发生变化吗？
传感器的“正常”测量值能否用统计模型来表征（例如，数据是高斯正态还是对数正态）？

回答完这些类型的问题后，您可以使用数据库中的一批数据训练分类器或异常检测器，并使用结果来评估未来的日志输出。如果机器学习算法适用于您的数据，您可以考虑使用scikit-learn。对于统计模型，您可以使用SciPystats的子包。当然，对于 Python 中的任何类型的数值数据操作，NumPy都是您的朋友。

python - 分析趋势并发现异常行为

2 回答 2

Related

Reference