我正在尝试构建一个具有序列到类用例的神经网络。我有一个包含 7 列的数据框:
index ID timestamp x1 x2 x3 date_maturity_encoded target_maturity
79 96273 2015-01-08 [] [] project1 29 06
80 96273 2015-01-08 [] [] project1 29 06
81 96273 2015-01-08 [] [] project1 29 06
82 96273 2015-01-19 [] [] project1 29 06
83 96273 2015-06-15 [] [] project1 39 06
84 96273 2016-02-28 [] [] project2 57 06
85 96274 2015-01-08 [] [] project2 29 08
86 96274 2015-01-08 [] [] project2 29 08
87 96274 2015-01-08 [] [] project2 29 08
88 96274 2015-02-26 [] [] project2 29 08
89 96274 2015-03-02 prg46 X1.80 [] project2 29 08
90 96274 2015-03-27 [] [] project2 35 08
91 96274 2015-04-09 [] [] project2 35 08
92 96274 2015-04-21 prg46 X1.80 [] project2 37 08
93 96274 2015-06-09 [] [] project2 39 08
94 96274 2015-06-23 [] [] project2 40 08
95 96274 2015-08-03 CW_38/15 [] project2 40 08
96 96274 2015-09-09 [] [] project2 52 08
97 96274 2015-09-21 [] [] project2 29 08
98 96274 2015-10-09 [] [] project2 29 08
99 96274 2016-03-01 CW_38/15 [] project2 57 08
- 前 6 列是输入,第 7 列是输出。
ID
并且x3
是数据集需要分组和聚合的属性。x3
每个总是有一个ID
。一个ID
可以有i
行。- 列
x1
和x2
用字符串填充。timestamp
列是日期。
target_maturity
是需要预测的目标值。
首先,我使用 LabelEncoder 对目标值进行编码:
### ENCODE PROJECTS WITH LABEL ENCODER
le = preprocessing.LabelEncoder()
le.fit(df.x3.unique())
df["x3_encoded"] = le.transform(df["x3"])
### ENCODE OUTPUT DATA
le.fit(df.target_maturity.unique())
df["target_maturity_encoded"] = le.transform(df["target_maturity"])
target = df.drop_duplicates(subset='ID', keep='first') #keep the first occurence of target value per ID
target = target['target_maturity_encoded']
接下来我将 x1/x2 中的字符串操作为数字序列:
tok = Tokenizer(char_level=True)
df['x1'] = [str(i) for i in df['x1']]
tok.fit_on_texts(df['x1'])
df['x1'] = tok.texts_to_sequences(df['x1'])
df['x2'] = [str(i) for i in df['x2']]
tok.fit_on_texts(df['x2'])
df['x2'] = tok.texts_to_sequences(df['x2'])
index ID timestamp x1 x2 x3_encoded date_maturity_encoded target_maturity_encoded
79 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
80 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
81 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
82 96273 2015-01-19 [1, 2] [2, 1] 1 29 3
83 96273 2015-06-15 [1, 2] [2, 1] 1 39 3
84 96273 2016-02-28 [1, 2] [2, 1] 1 57 3
85 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
86 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
87 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
88 96274 2015-02-26 [1, 2] [2, 1] 2 29 5
89 96274 2015-03-02 [3, 3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4] [2, 1] 2 29 5
90 96274 2015-03-27 [1, 2] [2, 1] 2 35 5
91 96274 2015-04-09 [1, 2] [2, 1] 2 35 5
92 96274 2015-04-21 [3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4] [2, 1] 2 37 5
93 96274 2015-06-09 [1, 2] [2, 1] 2 39 5
94 96274 2015-06-23 [1, 2] [2, 1] 2 40 5
95 96274 2015-08-03 [3, 3, 42, 13, 7, 15, 16, 39, 5, 22] [2, 1] 2 40 5
96 96274 2015-09-09 [1, 2] [2, 1] 2 52 5
97 96274 2015-09-21 [1, 2] [2, 1] 2 29 5
98 96274 2015-10-09 [1, 2] [2, 1] 2 29 5
99 96274 2016-03-01 [42, 13, 7, 15, 16, 39, 5, 22] [2, 1] 2 57 5
由于我试图预测每个 ID 的一个目标值,并且由于一个项目对于一个 ID 是相同的,因此我将数据分组如下:
df = df[['ID', 'x3_encoded', 'timestamp', 'x1', 'x2', 'date_maturity_encoded']] # changing order and filtering out output data
data = df.groupby(['ID','x3_encoded']).agg(lambda x: x.tolist()) # aggregating dataframe as dataframe of lists
ID x3_encoded timestamp x1 x2 date_maturity_encoded
96273 1 [2015-01-08, 2015-01-08, 2015-01-08, 2015-01-1... [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]] [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1]] [29, 29, 29, 29, 39, 57]
96274 2 [2015-01-08, 2015-01-08, 2015-01-08, 2015-02-2... [[1, 2], [1, 2], [1, 2], [1, 2], [3, 3, 24, 18... [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1... [29, 29, 29, 29, 29, 35, 35, 37, 39, 40, 40, 5...
定义输出类的数量:
### ENCODE list_maturities
num_classes = len(np.unique(df[['vr_maturity', 'date_maturity']].values)) # (0-127) 128 classes in total
一种热编码输出:
output_data = k.utils.to_categorical(target, num_classes = num_classes)
从作为输入的数据创建一个数组:
data_array = data.to_numpy(dtype=object)
训练测试拆分:
input_shape = data_array[0].shape
x_train, x_test, y_train, y_test = train_test_split(data_matrix,
output_data,
test_size=0.1,
shuffle = True)
适合型号:
model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(units=16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.build(input_shape)
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=10000,
epochs=5,
verbose=1,
validation_split=0.1)
毕竟,我收到了错误。我也尝试将输入数据中的每个元素作为数组进行操作,但是如果x_train
没有收到错误,我就无法进行事件操作。
x_tr = np.asarray([np.asarray(row, dtype=float) for row in x_train], dtype=float)
y_tr = np.asarray([np.asarray(row, dtype=float) for row in y_train], dtype=float)
如何将充满字符串的数据框中的序列拟合到多类问题?使用 keras 将序列转换为矩阵会弄乱数据帧。在阅读使用 keras 时出现相同错误的每篇文章后,我完全不知道如何解决这个问题。
2019-11-15 23:28:39.184411: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Traceback (most recent call last):
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-105-49dec6ee8dff>", line 28, in <module>
validation_split=0.1)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training.py", line 1039, in fit
validation_steps=validation_steps)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2655, in _call
dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
遵循@DanielMöller 的建议,就我而言:
在标记序列之前:
### - Convert the timestamps into numbers and normalize them
df['timestamp_int'] = pd.to_datetime(df['timestamp']).astype('int64')
df['timestamp_int'].head()
max_a = df.timestamp_int.max()
min_a = df.timestamp_int.min()
min_norm = 0
max_norm = 1
df['timestamp_NORMA'] = (df.timestamp_int - min_a) * (max_norm - min_norm) / (max_a - min_a) + min_norm
df['timestamp_NORMA'].head()
一 - 热编码:
df["date_maturity_one_hot"] = ""
num_classes = len(np.unique(list_maturities_encoded))
df["date_maturity_one_hot"] =
k.utils.to_categorical(df["date_maturity_encoded"], num_classes=num_classes).tolist()
标记序列后:
Zero_pad x1 和 x2:
df['x1_pad'] = ""
df['x1_pad'] = pad_sequences(df['x1'], maxlen=max(df.x1.apply(len))).tolist()
df['x2_pad'] = ""
df['x2_pad'] = pad_sequences(df['x2'], maxlen=max(df.x2.apply(len))).tolist()
按 ID 和 x3_encoded 分组:
agg_input_data = df.groupby(['ID', 'x3_encoded']).agg(lambda: x.to_list()).reset_index()
Zero_pad 列表列表:
cols = ['timestamp_NORMA', 'x1_pad', 'x2_pad', 'date_maturity_one_hot']
max_len = 118 # maximum rows an ID has in df
for i, r in agg_input_data.iterrows():
for col in cols:
max_char = max(input_data[col].apply(len)) ### number of characters in column
N = max_len - len(agg_input_data.loc[i, col]) ### number of padding difference (118 - len(list of lists in column)
agg_input_data.at[i, col] = [[0] * max_char] * N + agg_input_data.at[i, col]
多输入处理:
max_timestamp_NORMA_length = max(agg_input_data.timestamp_NORMA.apply(len))
max_x1_pad_length = max(agg_input_data.x1_pad.apply(len))
max_x2_pad_length = max(agg_input_data.x2_pad.apply(len))
timeStampInput = Input((max_timestamp_NORMA_length,))
x1Input = Input((max_timestamp_NORMA_pad_length, max_x1_pad_length))
x2Input = Input((max_timestamp_NORMA_pad_length, max_x2_pad_length))
maturityInput = Input((max_timestamp_NORMA_pad_length,))
嵌入:
characterEmbedding = Embedding(298, 128) # max_chars & embedding_size
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)
maturityEmbed = Embedding(127, 12)(maturityInput) # number_of_maturity_classes, embedding_size_2
在:
timeStampInput.shape
出[57]:
TensorShape([Dimension(None), Dimension(118)])
在:
maturityEmbed.shape
出[58]:
TensorShape([Dimension(None), Dimension(118), Dimension(12)])
使用 LSTM 减少序列长度:
timeStampEncoded = LSTM(118)(timeStampInput)
timeStampEncoded = LSTM(118)(timeStampInput) Traceback(最近一次调用最后一次):文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py”,第 3296 行,在 run_code exec(code_obj, self.user_global_ns, self.user_ns) 文件“”,第 1 行,在 <模块> timeStampEncoded = LSTM(118)(timeStampInput) 文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib \site-packages\keras\layers\recurrent.py",第 532 行,调用 返回 super(RNN, self)。调用(输入,**kwargs)文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py”,第 414 行,调用中 self.assert_input_compatibility(inputs) 文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py”,第 311 行,在 assert_input_compatibility str(K.ndim(x)) )
ValueError: Input 0 is in compatible layer lstm_1: expected = 3, found ndim = 2