根据类似行为对我的数据进行聚类后,我现在正在努力检测每个聚类中的异常情况。数据是pandas.Dataframe()
s 的列表,如下所示:
In [0]: ip_series
Out [0]: [ rolling_mean
rt
2021-01-13 12:00:00 0.000000
2021-01-13 17:00:00 0.005034
2021-01-13 18:00:00 0.003356
2021-01-14 00:00:00 0.003523
2021-01-14 01:00:00 0.010067
... ...
2021-01-31 07:00:00 0.430872
2021-01-31 08:00:00 0.444104
2021-01-31 09:00:00 0.390856
2021-01-31 19:00:00 0.518255
2021-01-31 20:00:00 0.440268
[153 rows x 1 columns],
rolling_mean
rt
2021-01-13 12:00:00 0.003598
2021-01-13 17:00:00 0.003598
2021-01-13 18:00:00 0.000000
2021-01-14 00:00:00 0.003598
2021-01-14 01:00:00 0.003598
... ...
2021-01-31 07:00:00 0.773146
2021-01-31 08:00:00 0.773917
2021-01-31 09:00:00 0.676952
2021-01-31 19:00:00 0.599496
2021-01-31 20:00:00 0.528068
[153 rows x 1 columns],
...]
如您所见,数据帧由时间戳和某些值(这些值已经标准化)组成。作为预处理步骤,我正在重塑数据:
In [1]: train_data = ip_series.copy()
for i in range(len(ip_series)):
train_data[i] = train_data[i].values.reshape(len(train_data[i]))
In [2]: train_data[0]
Out [2]: array([0. , 0.00503356, 0.0033557 , 0.00352349, 0.01006711,
0.01979866, 0.05378715, 0.11764142, 0.14122723, 0.16423778,
0.1906999 , 0.2042186 , 0.3008629 , 0.34443912, 0.33494727,
0.3596836 , 0.36917546, 0.34341443, 0.40800575, 0.37260906,
0.33277405, 0.32063758, 0.26728188, 0.26442953, 0.24161074,
0.21221477, 0.17775647, 0.22924257, 0.22147651, 0.19932886,
0.18098434, 0.16328859, 0.15830537, 0.2010906 , 0.17401726,
0.17833174, 0.43127517, 0.3590604 , 0.36931927, 0.33394056,
0.32603068, 0.33510906, 0.31353468, 0.28540268, 0.34440716,
0.32628635, 0.33133389, 0.35725671, 0.32718121, 0.31233221,
0.31258389, 0.31963087, 0.30629195, 0.2886745 , 0.30488974,
0.29798658, 0.28062081, 0.33451582, 0.32387344, 0.29697987,
0.29043624, 0.26823266, 0.37561521, 0.53758389, 0.59261745,
0.63199105, 0.57516779, 0.58612975, 0.65486577, 0.74421141,
0.67181208, 0.49731544, 0.52167785, 0.33704698, 0.30241611,
0.28791946, 0.30040268, 0.2933557 , 0.3300183 , 0.36129754,
0.40067114, 0.36563758, 0.34996949, 0.35004794, 0.42511985,
0.38513902, 0.35134228, 0.31722595, 0.29255034, 0.19907718,
0.29345638, 0.29888143, 0.39986577, 0.52067114, 0.43456376,
0.43087248, 0.36362416, 0.32550336, 0.33854267, 0.32491611,
0.28948546, 0.23713647, 0.23214765, 0.23395973, 0.23818792,
0.25530201, 0.25328859, 0.24181208, 0.26687004, 0.23575351,
0.2319097 , 0.29888143, 0.61937919, 0.84161074, 0.88906999,
0.96409396, 1. , 0.86462128, 0.76208054, 0.77491611,
0.53833893, 0.48903803, 0.36711409, 0.3344519 , 0.31932886,
0.3147651 , 0.3442953 , 0.34272931, 0.30825503, 0.32295302,
0.4541387 , 0.53255034, 0.49651007, 0.55026846, 0.53496644,
0.51982916, 0.66241611, 0.86935123, 0.84020134, 0.7876144 ,
0.72365772, 0.69295302, 0.64383067, 0.49530201, 0.51159243,
0.52037828, 0.50756559, 0.35349952, 0.43087248, 0.44410355,
0.3908557 , 0.51825503, 0.44026846])
聚类过程发生TimeSeriesKMeans
在数据上train_data
:
km = TimeSeriesKMeans(n_clusters=72, metric='dtw')
labels = km.fit_predict(train_data)
这一步至关重要,因为数据具有非常不同的行为,我的目标是基于此集群数据创建多个模型,以使用隔离森林检测每个时间序列(在每个集群中)的异常。因此,我正在创建一个按集群排序的数据框列表。
df_test = pd.Dataframe(zip(train_data, labels))
df_test.cloumns['values', 'cluster]
# transform df_test into list of dataframes sorted per cluster
cluster_df_list = []
for i in set(labels):
df_train_iforest = df_test.loc[df_test['cluster'] == i].reset_index(drop=True)
cluster_df_list.append(df_train_iforest)
# training
for i in range(len(cluster_df_list)):
for j in range(len(cluster_df_list[i]['values'])):
train_data_iforest = (cluster_df_list[i]['values'][j]).reshape(-1,1)
model = IsolationForest()
model.fit(train_data_iforest)
cluster_df_list[i]['anomaly'] = pd.Series(model.predict(train_data_iforest))
cluster_df_list[i]['anomaly'] = cluster_df_list[i]['anomaly'].map({1:0, -1:1})
anomaly_cluster_df = cluster_df_list[i].loc[cluster_df_list[i]['anomaly'] == 1].reset_index(drop=True)
我得到的是完整的数组,它们被检测为异常值。但我更想要一个“经典隔离森林”,这意味着检测集群中每个数组的异常点。我究竟做错了什么?我的预处理不正确还是我必须以不同的方式提供模型?
TLDR:如何按集群训练单个模型,而不是检测每个集群的异常阵列,而是检测每个阵列的异常点?