我正在尝试从给定的一维数据流列表中检测漂移。如果数据中没有趋势,我期待0 <= confidence score <= 0.20
,但如果检测到漂移,我期待0.90 <= confidence score <= 1
。
我附上了我正在使用的 Python 3.x 代码片段,以及我的手工计算(最后的图片)。
import numpy as np
from univariate import UnivariateAnalysis
from scipy import stats
class UnivariateDriftAnalysis:
''' This technique looks for a trend in recent data using linear
regression as a statistical test that the trend is non-zero
Currently, this uses a fixed window length, but future versions might
incorporate a search over a range of window lengths
'''
def __init__(self, n_window, p=0.01):
'''
n_window - (int) length of data history to look for a trend
p - (int) desired confidence or false positive rate.
p=.05 means that alarms will be raised when there is <5% chance
that there is no trend
'''
self.n_window = n_window
self.p = p
def drift_detected(self, data) -> list:
''' Returns an array, x, of probabilities that the slope of the data is
not zero. i.e., the confidence that there is a slope.
x[i] corresponds to the slope of data[i-n_window:i]
The first n_window values of x are np.NaN
'''
n = len(data)
y = []
x0 = np.arange(n)
result: list = [np.NaN] * self.n_window
i = 0
for d in data:
y.append(d)
if len(y) < self.n_window:
# if max_history_samples < window_length
continue
y = y[-self.n_window:]
x = x0[i:i + self.n_window]
p_value = stats.linregress(x, y).pvalue
# slope, intercept, r_value, p_value, std_err = rez
result.append(1-p_value)
i += 1
return result
def update(self, data) -> None:
''' this function is designed to handle live stream of data'''
scores = self.alarm_score(data)
alarms = [r < self.p for r in alarm_scores]
# some other stuff
# Test
np.random.seed(100)
n_window = 10
lr = LinearRegressionSPC(n_window=n_window, p=.01)
data = np.concatenate([np.random.randint(24, 47, 1500), np.random.randint(1000, 4000, 2000), np.random.randint(1, 5, 500)])
score = lr.alarm_score(data)
print(result[n_window:]) # lowest: 0 highest: 0.9953301824956942
问题:
- 我错过了什么?为什么置信度分数高达0.9953!?
- 我的最终目标是
p value
为给定的数据数组定义以计算漂移存在置信度。