我正在使用 ELMo 编码和双向 LSTM 重现架构,前两层看起来像这样:
input_layer = Input(shape=(1,), dtype="string", name="Input_layer")
embedding_layer = Lambda(ELMoEmbedding, output_shape=(1024, ), name="Elmo_Embedding")(input_layer)
但是,我不确定如何插入它们而不是我现有的 Keras 嵌入层:
Embedding(len(vocab), embedding_dimension, input_length=maximal_sentence_length)
输入数据在训练之前被标记化,因此它不是 ELMo 实现所需的真正的字符串类型:
def read_dataset(data_file, vocab_to_id, sent_len, debug=False):
'''
read training set or test set
:param data_file:
:param vocab_to_id:
:param sent_len: the
:param debug: load only a small fraction of samples to debug
:return: model's input and labels
need about 1min31s for training set and 2min for test set
'''
labels, _ = get_label()
unknown_id = len(vocab_to_id) - 1
data_x, data_y = list(), list()
cnt = 0
for sample in tqdm(load_data(data_file)):
# print(sample)
# for debugging
cnt += 1
if debug and cnt > 100:
break
summary = str.lower(sample.summary)
tokens = nltk.word_tokenize(summary)
token_ids = [vocab_to_id.get(t, unknown_id) for t in tokens]
token_ids = pad_sentence(token_ids, sent_len)
data_x.append(token_ids)
occupations = sample.occupation
# train
if occupations:
y_vector = [1 if label in occupations else 0 for label in labels]
data_y.append(y_vector)
# test
else:
data_y.append(0)
return np.array(data_x), np.array(data_y)