0

我正在尝试从给定的一维数据流列表中检测漂移。如果数据中没有趋势,我期待0 <= confidence score <= 0.20,但如果检测到漂移,我期待0.90 <= confidence score <= 1

我附上了我正在使用的 Python 3.x 代码片段,以及我的手工计算(最后的图片)。

import numpy as np
from univariate import UnivariateAnalysis
from scipy import stats


class UnivariateDriftAnalysis:
    ''' This technique looks for a trend in recent data using linear
    regression as a statistical test that the trend is non-zero
    Currently, this uses a fixed window length, but future versions might
    incorporate a search over a range of window lengths
    '''

    def __init__(self, n_window, p=0.01):
        '''
        n_window - (int) length of data history to look for a trend
        p - (int) desired confidence or false positive rate.
            p=.05 means that alarms will be raised when there is <5% chance
            that there is no trend
        '''
        self.n_window = n_window
        self.p = p

    def drift_detected(self, data) -> list:
        ''' Returns an array, x, of probabilities that the slope of the data is
        not zero. i.e., the confidence that there is a slope.
        x[i] corresponds to the slope of data[i-n_window:i]
        The first n_window values of x are np.NaN
        '''
        n = len(data)
        y = []
        x0 = np.arange(n)
        result: list = [np.NaN] * self.n_window
        i = 0
        for d in data:
            y.append(d)
            if len(y) < self.n_window:
                # if max_history_samples < window_length
                continue
            y = y[-self.n_window:]
            x = x0[i:i + self.n_window]
            p_value = stats.linregress(x, y).pvalue
            # slope, intercept, r_value, p_value, std_err = rez
            result.append(1-p_value)
            i += 1
        return result

    def update(self, data) -> None:
        ''' this function is designed to handle live stream of data'''
        scores = self.alarm_score(data)
        alarms = [r < self.p for r in alarm_scores]
        # some other stuff

# Test
np.random.seed(100)
n_window = 10
lr = LinearRegressionSPC(n_window=n_window, p=.01)
data = np.concatenate([np.random.randint(24, 47, 1500), np.random.randint(1000, 4000, 2000), np.random.randint(1, 5, 500)])
score = lr.alarm_score(data)
print(result[n_window:])  # lowest: 0 highest: 0.9953301824956942

问题:

  • 我错过了什么?为什么置信度分数高达0.9953!?
  • 我的最终目标是p value为给定的数据数组定义以计算漂移存在置信度。

手动计算p值截图

4

0 回答 0