tensorflow - 使用带有 LSTM 的 Word 嵌入来防止文本分类的过度拟合

Question

客观的：

使用用户输入的问题（如问答系统）识别类标签。
从大 PDF 文件中提取的数据，需要根据用户输入预测页码。
主要用于政策文件，用户对政策有疑问，需要显示特定的页码。

以前的实现：应用了弹性搜索，但准确性非常低，因为用户输入任何文本，如“我需要”==“想要”

数据集信息：数据集包含每一行，文本（或段落）和标签（作为页码）。这里的数据集很小，我只有 500 行。

当前实施：

在 Keras 中使用 LSTM 应用词嵌入（Glove），后端是 Tensor-flow
应用辍学
应用活动正则化
应用 L2 W_regularizer（从 0.1 到 0.001）
应用了从 10 到 600 的不同 nb_epoch
将手套数据的 EMBEDDING_DIM 从 100 更改为 300

应用自然语言处理，

转换为小写
删除英语的停用词
词干
删除号码
删除 URL 和 IP 地址

结果：测试数据（或验证数据）的准确性为 23%，但训练数据的准确性为 91%

代码：

import time
from time import strftime

import numpy as np
from keras.callbacks import CSVLogger, ModelCheckpoint
from keras.layers import Dense, Input, LSTM, ActivityRegularization
from keras.layers import Embedding, Dropout,Bidirectional
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.regularizers import l2
from keras.utils import to_categorical

import pickle
from DataGenerator import *

BASE_DIR = ''
GLOVE_DIR = 'D:/Dataset/glove.6B'  # BASE_DIR + '/glove.6B/'

MAX_SEQUENCE_LENGTH = 50
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector
np.random.seed(1337)  # for reproducibility

print('Indexing word vectors.')

t_start = time.time()

embeddings_index = {}

if os.path.exists('pickle/glove.pickle'):
    print('Pickle found..')
    with open('pickle/glove.pickle', 'rb') as handle:
        embeddings_index = pickle.load(handle)
else:
    print('Pickle not found...')
    f = open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    with open('pickle/glove.pickle', 'wb') as handle:
        pickle.dump(embeddings_index, handle, protocol=pickle.HIGHEST_PROTOCOL)

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels = []  # list of label ids
labels_index = {}  # dictionary mapping label name to numeric id

(texts, labels, labels_index) = get_data('D:/PolicyDocument/')

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))
print('Preparing embedding matrix. :', embedding_matrix.shape)
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            mask_zero=True,
                            trainable=False)

print('Training model.')

csv_file = "logs/training_log_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".csv"
model_file = "models/Model_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".mdl"
print("Model file:" + model_file)
csv_logger = CSVLogger(csv_file)

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

rate_drop_lstm = 0.15 + np.random.rand() * 0.25
num_lstm = np.random.randint(175, 275)
rate_drop_dense = 0.15 + np.random.rand() * 0.25

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001))(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64)(x)
x = Dropout(0.25)(x)
x = ActivityRegularization(l1=0.01, l2=0.001)(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

model_checkpoint = ModelCheckpoint(model_file, monitor='val_loss', verbose=0, save_best_only=True,
                                   save_weights_only=False, mode='auto')

model.fit(x_train, y_train,
          batch_size=1,
          nb_epoch=600,
          validation_data=(x_val, y_val), callbacks=[csv_logger, model_checkpoint])

score = model.evaluate(x_val, y_val, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

t_end = time.time()
total = t_end - t_start
ret_str = "Time needed(s): " + str(total)
print(ret_str)

score 10 · Accepted Answer

Dropout 和 BN 对于前馈神经网络非常有效。但是，它们可能会导致 RNN 出现问题（有很多关于此主题的论文发表）

使 RNN 模型更好地泛化的最佳方法是增加数据集大小。在您的情况下（具有大约 200 个单元的 LSTM），您可能希望拥有大约 100,000 个或更多的标记样本进行训练。

score 7 · Accepted Answer

除了简单地减少嵌入大小和某些层中的单元数量等参数外，还可以调整 LSTM 中的经常性 dropout。

LSTM 似乎很容易过拟合（所以我读过）。

然后你可以在Keras 文档中看到每一层的使用dropout和作为参数。recurrent_dropoutLSTM

具有任意数字的示例：

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001), recurrent_dropout=0.4)(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64,dropout=0,5, recurrent_dropout=0,3)(x)

其他原因可能是数据错误或数据不足：

您是否尝试过将测试和验证数据混在一起并创建新的训练和验证集？
训练数据中有多少个句子？你在尝试小套装吗？使用整个集合或尝试数据增强（创建新句子及其分类 - 但这对于文本可能非常棘手）。

score 3 · Accepted Answer

你所描述的听起来很像过度拟合。如果没有关于数据的更多信息，最好的建议是尝试更强大的正则化方法。@Daniel 已经建议您使用未使用的 dropout 参数 -dropout和recurrent_dropout. 您还可以尝试增加 dropout 层的比率，对W_regularizer参数使用更强的正则化。

可以使用更多信息打开其他选项，例如您是否尝试过 Daniel 的建议以及结果如何。

score 0 · Accepted Answer

对抗性训练方法（作为正则化的一种手段）可能值得研究。半监督文本分类的对抗训练方法

tensorflow - 使用带有 LSTM 的 Word 嵌入来防止文本分类的过度拟合

客观的 ：

当前实施：

代码 ：

4 回答 4

Related

Reference

客观的：

代码：