python - 如何修复对话（文本）时间序列数据中的类别不平衡？

Question

我有一个如下所示的数据集：

df.head(5)


 data                                                     labels
0  [0.0009808844009380855, 0.0008974465127279559]             1
1  [0.0007158940267629654, 0.0008202958833774329]             3
2  [0.00040971929722210984, 0.000393972522972382]             3
3  [7.916243163372941e-05, 7.401835468434177e243]             3
4  [8.447556379936086e-05, 8.600626393842705e-05]             3

“数据”列是我的 X，标签是 y。df 有 34890 行。每行包含 2 个浮点数。数据代表一堆连续的文本，每个观察都是一个句子的表示。有5个班。

我正在用这个 LSTM 代码训练它：

data = df.data.values
labels = pd.get_dummies(df['labels']).values

X_train, X_test, y_train, y_test = train_test_split(data,labels, test_size = 0.10, random_state = 42)

X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1])) # shape = (31401, 1, 5)
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1])) # shape = (3489, 1, 5)
### y_train shape =  (31401, 5)
### y_test shape =  (3489, 5)

### Bi_LSTM
Bi_LSTM = Sequential()
Bi_LSTM.add(layers.Bidirectional(layers.LSTM(32)))
Bi_LSTM.add(layers.Dropout(.5))
# Bi_LSTM.add(layers.Flatten())
Bi_LSTM.add(Dense(11, activation='softmax'))

def compile_and_fit(history):

    history.compile(optimizer='rmsprop',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    history = history.fit(X_train,
              y_train,
              epochs=30,
              batch_size=32,
              validation_data=(X_test, y_test))

    return history

LSTM_history = compile_and_fit(Bi_LSTM)

模型进行训练，但每个 epoch 的 val 准确率固定为 53%。我假设这是因为我的类不平衡问题（1 个类占用了约 53% 的数据，另外 4 个类在剩余的 47% 中稍微均匀分布）。

如何平衡我的数据？我知道非时间序列数据的典型过度/不足采样技术，但我不能过度/不足采样，因为这会混淆数据的顺序时间序列性质。有什么建议吗？

编辑

我正在尝试使用 Keras 中的 class_weight 参数来解决这个问题。我将此字典传递给 class_weight 参数：

class_weights = {
    0: 1/len(df[df.label == 1]),
    1: 1/len(df[df.label == 2]),
    2: 1/len(df[df.label == 3]),
    3: 1/len(df[df.label == 4]),
    4: 1/len(df[df.label == 5]),
}

我基于此建议：

https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes

然而，acc/loss 现在真的很糟糕。我使用密集网络获得了约 30% 的准确率，因此我预计 LSTM 会有所改进。请参阅下面的 acc/loss 曲线：

score 1 · Accepted Answer

Keras/Tensorflow 允许使用class_weight或方法sample_weightsmodel.fit

class_weight在目标函数的计算中影响每个类的相对权重。sample_weights，顾名思义，允许进一步控制属于同一类的样本的相对权重

class_weight接受一个字典，您可以在其中计算每个类的权重，同时sample_weights接收一个单变量数组 dim == len(y_train) 在其中为每个样本分配特定的权重

python - 如何修复对话（文本）时间序列数据中的类别不平衡？

编辑

1 回答 1

Related

Reference