这个问题是这个问题的延伸,它侧重于 LSTM 而不是 CRF。不幸的是,我对 CRF 没有任何经验,这就是我问这些问题的原因。
问题:
我想预测多个非独立组的二进制信号序列。我的数据集比较小(每组约 1000 条记录),所以我想在这里尝试一个 CRF 模型。
可用数据:
我有一个包含以下变量的数据集:
- 时间戳
- 团体
- 表示活动的二进制信号
group_a_activity
使用这个数据集,我想预测group_b_activity
它是 0 还是 1。
请注意,这些组被认为是互相关的,并且可以从时间戳中提取额外的特征——为简单起见,我们可以假设我们从时间戳中提取的只有 1 个特征。
到目前为止我所拥有的:
这是您可以在自己的机器上复制的数据设置。
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
在我们进入 CRF 部分之前,我怀疑我不能从多任务学习的角度来解决这个问题(通过一个模型预测 A 和 B 的模式),因此我将不得不预测每一个他们分别。
现在是 CRF 部分。我找到了一些相关的例子(这里是一个),但它们都倾向于根据先验序列预测单个类值。
这是我在这里使用 CRF 的尝试:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
看来我确实设法让它工作,但我不确定我是否正确地接近它。我将在问题部分提出我的问题,但首先,这是使用keras_contrib
包的另一种方法:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
问题:
我的主要问题是我是否正确构建了两个 CRF 模型。让我担心的是(1)没有很多关于 CRF 模型的文档,(2)CRF 主要用于预测给定序列的单个标签,(3)输入特征是嵌套的,(4)当以多任务方式使用,我不确定它是否有效。
我还有一些额外的问题:
- CRF 适合这个问题吗?
- 这两种方法(一种基于
pycrfuite
,一种基于keras_contrib
)有何不同,它们的优点/缺点是什么? - 从更一般的意义上说,将 CRF 和 LSTM 模型合二为一有什么好处(就像这里讨论的那样)
非常感谢!