3

我正在尝试构建一个具有序列到类用例的神经网络。我有一个包含 7 列的数据框:

index    ID    timestamp                     x1                   x2                 x3           date_maturity_encoded    target_maturity

79      96273  2015-01-08                    []                   []                project1                 29          06
80      96273  2015-01-08                    []                   []                project1                 29          06
81      96273  2015-01-08                    []                   []                project1                 29          06
82      96273  2015-01-19                    []                   []                project1                 29          06
83      96273  2015-06-15                    []                   []                project1                 39          06
84      96273  2016-02-28                    []                   []                project2                 57          06
85      96274  2015-01-08                    []                   []                project2                 29          08
86      96274  2015-01-08                    []                   []                project2                 29          08
87      96274  2015-01-08                    []                   []                project2                 29          08
88      96274  2015-02-26                    []                   []                project2                 29          08
89      96274  2015-03-02           prg46 X1.80                   []                project2                 29          08
90      96274  2015-03-27                    []                   []                project2                 35          08
91      96274  2015-04-09                    []                   []                project2                 35          08
92      96274  2015-04-21           prg46 X1.80                   []                project2                 37          08
93      96274  2015-06-09                    []                   []                project2                 39          08
94      96274  2015-06-23                    []                   []                project2                 40          08
95      96274  2015-08-03              CW_38/15                   []                project2                 40          08
96      96274  2015-09-09                    []                   []                project2                 52          08
97      96274  2015-09-21                    []                   []                project2                 29          08
98      96274  2015-10-09                    []                   []                project2                 29          08
99      96274  2016-03-01              CW_38/15                   []                project2                 57          08
  • 前 6 列是输入,第 7 列是输出。
  • ID并且x3是数据集需要分组和聚合的属性。
  • x3每个总是有一个ID。一个ID可以有i行。
  • x1x2用字符串填充。timestamp列是日期。

target_maturity是需要预测的目标值。

首先,我使用 LabelEncoder 对目标值进行编码:

### ENCODE PROJECTS WITH LABEL ENCODER
le = preprocessing.LabelEncoder()
le.fit(df.x3.unique())
df["x3_encoded"] = le.transform(df["x3"])


### ENCODE OUTPUT DATA
le.fit(df.target_maturity.unique())
df["target_maturity_encoded"] = le.transform(df["target_maturity"])
target = df.drop_duplicates(subset='ID', keep='first') #keep the first occurence of target value per ID
target = target['target_maturity_encoded']

接下来我将 x1/x2 中的字符串操作为数字序列:

tok = Tokenizer(char_level=True)
df['x1'] = [str(i) for i in df['x1']]
tok.fit_on_texts(df['x1'])
df['x1'] = tok.texts_to_sequences(df['x1'])


df['x2'] = [str(i) for i in df['x2']]
tok.fit_on_texts(df['x2'])
df['x2'] = tok.texts_to_sequences(df['x2'])
index    ID    timestamp                        x1                                        x2                 x3_encoded  date_maturity_encoded    target_maturity_encoded

79      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
80      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
81      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
82      96273  2015-01-19                                           [1, 2]               [2, 1]                   1                     29          3
83      96273  2015-06-15                                           [1, 2]               [2, 1]                   1                     39          3
84      96273  2016-02-28                                           [1, 2]               [2, 1]                   1                     57          3
85      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
86      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
87      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
88      96274  2015-02-26                                           [1, 2]               [2, 1]                   2                     29          5
89      96274  2015-03-02  [3, 3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4]               [2, 1]                   2                     29          5
90      96274  2015-03-27                                           [1, 2]               [2, 1]                   2                     35          5
91      96274  2015-04-09                                           [1, 2]               [2, 1]                   2                     35          5
92      96274  2015-04-21     [3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4]               [2, 1]                   2                     37          5
93      96274  2015-06-09                                           [1, 2]               [2, 1]                   2                     39          5
94      96274  2015-06-23                                           [1, 2]               [2, 1]                   2                     40          5
95      96274  2015-08-03             [3, 3, 42, 13, 7, 15, 16, 39, 5, 22]               [2, 1]                   2                     40          5
96      96274  2015-09-09                                           [1, 2]               [2, 1]                   2                     52          5
97      96274  2015-09-21                                           [1, 2]               [2, 1]                   2                     29          5
98      96274  2015-10-09                                           [1, 2]               [2, 1]                   2                     29          5
99      96274  2016-03-01                   [42, 13, 7, 15, 16, 39, 5, 22]               [2, 1]                   2                     57          5

由于我试图预测每个 ID 的一个目标值,并且由于一个项目对于一个 ID 是相同的,因此我将数据分组如下:

df = df[['ID', 'x3_encoded', 'timestamp', 'x1', 'x2',  'date_maturity_encoded']] # changing order and filtering out output data
data = df.groupby(['ID','x3_encoded']).agg(lambda x: x.tolist()) # aggregating dataframe as dataframe of lists
ID      x3_encoded       timestamp                                              x1                                          x2                                                        date_maturity_encoded
96273    1    [2015-01-08, 2015-01-08, 2015-01-08, 2015-01-1...    [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]]   [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1]]   [29, 29, 29, 29, 39, 57]  
96274    2     [2015-01-08, 2015-01-08, 2015-01-08, 2015-02-2...   [[1, 2], [1, 2], [1, 2], [1, 2], [3, 3, 24, 18...  [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1...  [29, 29, 29, 29, 29, 35, 35, 37, 39, 40, 40, 5...

定义输出类的数量:

### ENCODE list_maturities
num_classes = len(np.unique(df[['vr_maturity', 'date_maturity']].values)) # (0-127) 128 classes in total

一种热编码输出:

output_data = k.utils.to_categorical(target, num_classes = num_classes)

从作为输入的数据创建一个数组:

data_array = data.to_numpy(dtype=object) 

训练测试拆分:

input_shape = data_array[0].shape
x_train, x_test, y_train, y_test = train_test_split(data_matrix,
                                                    output_data,
                                                    test_size=0.1,
                                                    shuffle = True)

适合型号:

model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(units=16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.build(input_shape)
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=10000,
                    epochs=5,
                    verbose=1,
                    validation_split=0.1)

毕竟,我收到了错误。我也尝试将输入数据中的每个元素作为数组进行操作,但是如果x_train没有收到错误,我就无法进行事件操作。

x_tr = np.asarray([np.asarray(row, dtype=float) for row in x_train], dtype=float)
y_tr = np.asarray([np.asarray(row, dtype=float) for row in y_train], dtype=float)

如何将充满字符串的数据框中的序列拟合到多类问题?使用 keras 将序列转换为矩阵会弄乱数据帧。在阅读使用 keras 时出现相同错误的每篇文章后,我完全不知道如何解决这个问题。

2019-11-15 23:28:39.184411: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Traceback (most recent call last):
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-105-49dec6ee8dff>", line 28, in <module>
    validation_split=0.1)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2655, in _call
    dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

遵循@DanielMöller 的建议,就我而言:

在标记序列之前:

### - Convert the timestamps into numbers and normalize them
df['timestamp_int'] = pd.to_datetime(df['timestamp']).astype('int64')
df['timestamp_int'].head()
max_a = df.timestamp_int.max()
min_a = df.timestamp_int.min()
min_norm = 0
max_norm = 1
df['timestamp_NORMA'] = (df.timestamp_int - min_a) * (max_norm - min_norm) / (max_a - min_a) + min_norm
df['timestamp_NORMA'].head()

一 - 热编码:

df["date_maturity_one_hot"] = ""
num_classes = len(np.unique(list_maturities_encoded))
df["date_maturity_one_hot"] =
k.utils.to_categorical(df["date_maturity_encoded"], num_classes=num_classes).tolist()

标记序列后:

Zero_pad x1 和 x2:

df['x1_pad'] = ""
df['x1_pad'] = pad_sequences(df['x1'], maxlen=max(df.x1.apply(len))).tolist()

df['x2_pad'] = ""
df['x2_pad'] = pad_sequences(df['x2'], maxlen=max(df.x2.apply(len))).tolist()

按 ID 和 x3_encoded 分组:

agg_input_data = df.groupby(['ID', 'x3_encoded']).agg(lambda: x.to_list()).reset_index()

Zero_pad 列表列表:

cols = ['timestamp_NORMA', 'x1_pad', 'x2_pad', 'date_maturity_one_hot']
max_len = 118  # maximum rows an ID has in df

for i, r in agg_input_data.iterrows():
    for col in cols:
        max_char = max(input_data[col].apply(len))  ### number of characters in column
        N = max_len - len(agg_input_data.loc[i, col])  ### number of padding difference (118 - len(list of lists in column)
        agg_input_data.at[i, col] = [[0] * max_char] * N + agg_input_data.at[i, col]

多输入处理:

max_timestamp_NORMA_length = max(agg_input_data.timestamp_NORMA.apply(len))
max_x1_pad_length = max(agg_input_data.x1_pad.apply(len))
max_x2_pad_length = max(agg_input_data.x2_pad.apply(len))

timeStampInput = Input((max_timestamp_NORMA_length,))
x1Input = Input((max_timestamp_NORMA_pad_length, max_x1_pad_length))
x2Input = Input((max_timestamp_NORMA_pad_length, max_x2_pad_length))
maturityInput = Input((max_timestamp_NORMA_pad_length,))

嵌入:

characterEmbedding = Embedding(298, 128)  # max_chars & embedding_size
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)

maturityEmbed = Embedding(127, 12)(maturityInput)  # number_of_maturity_classes, embedding_size_2

在:

timeStampInput.shape

出[57]:

TensorShape([Dimension(None), Dimension(118)])

在:

maturityEmbed.shape

出[58]:

TensorShape([Dimension(None), Dimension(118), Dimension(12)])

使用 LSTM 减少序列长度:

timeStampEncoded = LSTM(118)(timeStampInput)

timeStampEncoded = LSTM(118)(timeStampInput) Traceback(最近一次调用最后一次):文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py”,第 3296 行,在 run_code exec(code_obj, self.user_global_ns, self.user_ns) 文件“”,第 1 行,在 <模块> timeStampEncoded = LSTM(118)(timeStampInput) 文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib \site-packages\keras\layers\recurrent.py",第 532 行,调用 返回 super(RNN, self)。调用(输入,**kwargs)文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py”,第 414 行,调用中 self.assert_input_compatibility(inputs) 文件“C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py”,第 311 行,在 assert_input_compatibility str(K.ndim(x)) )

ValueError: Input 0 is in compatible layer lstm_1: expected = 3, found ndim = 2

4

1 回答 1

3

这是将列表作为数组元素的情况。keras 的 numpy 数组必须具有相同类型和固定长度的所有值。

您现在可以做的最好的事情是将每一列分隔在不同的 X 数组中。

现在,您需要对这些数据进行大量处理,以便它可以进入神经网络。您可能应该将日期转换为数字,将类转换为 one-hot 编码,最糟糕的是,决定如何处理 和 中的列表x1列表x2

我可以看到您将需要:

聚合前:

  • 将时间戳转换为数字并对其进行规范化
  • 用零填充x1x2序列,以便所有序列具有相同的长度
  • 在这里阅读pad_sequences
  • 请注意,您必须将它们视为列表,而不是巨大的字符串

聚合后:

  • 填充时间戳序列
  • 填充日期到期序列
  • 再次填充x1andx2序列(因为它是一个列表列表,你是为内部列表做的,现在你是为外部列表做的,你需要用与内部列表相同大小的 numpy 数组来填充)

最后,您的模型将需要多个输入并处理这些序列:

timeStampInput = Input((max_time_length,))
x1Input = Input((max_time_length, max_x1_length))
x2Input = Input((max_time_length, max_x2_length))
maturityInput = Input((max_time_length,))

您将需要通过嵌入传递编码的输入,以便它们对模型具有有意义的值。理想情况下,您应该将 x1 和 x2 编码在一起,因为它们是字符序列,这将使您只需要一个嵌入而不是两个。

characterEmbedding = Embedding(max_chars, embedding_size)
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)

maturityEmbed = Embedding(number_of_maturity_classes, embedding_size_2)(maturityInput)

现在你将不得不减少序列的长度。LSTM 层应该可以很好地做到这一点。(您也可以尝试使用全局池化 Conv1D)

对于成熟度,此时应该具有 shape (batch, max_time_length, embedding_size_2),只是一个常规的 LSTM。也适用于时间戳

timeStampEncoded = LSTM(units_1)(timeStampInput)
maturityEncoded = LSTM(units_2)(maturityEmbed)

现在对于x1and x2,您需要在两个级别上使用它,因为它们是序列序列:

#inner dimension
x1Encoded = TimeDistributed(LSTM(units_in))(x1Embed)
x2Encoded = TimeDistributed(LSTM(units_in))(x2Embed)

#outer dimension
x1Encoded = LSTM(units_out)(x1Encoded)
x2Encoded = LSTM(units_out)(x2Encoded)

最后,您可以连接所有内容:

allInputs = Concatenate()([timeStampEncoded, maturityEncoded, x1Encoded, x2Encoded])

现在您可以使用常规的 2D 模型了:

out = Dense(units=8, activation='relu', input_shape=input_shape)(allInputs)
out = Dropout(0.2)(out)
out = Dense(units=16, activation='relu')(out)
out = Dropout(0.2)(out)
out = Dense(num_classes, activation='softmax')(out)

model = Model([timeStampInput, x1Input, x2Input, maturityInput], out)

您将需要使用四个输入来训练模型:

model.fit([timeStampArray, x1Array, x2Array, maturityArray], labels)

请注意,数据的形状应类似于:

  • timeStampArray.shape = (data_frame_length, max_time_length)
  • x1Array.shape = (data_frame_length, max_time_length, max_x1_length)
  • x2Array.shape = (data_frame_length, max_time_length, max_x2_length)
  • maturityArray.shape = (data_frame_length, max_time_length)

恐怕没有比这更好的了。您必须搜索有关 LSTM 预处理序列的问题,以更好地了解该做什么。

于 2019-11-18T18:41:23.700 回答