keras - 字符串数据类型的 keras pad_sequence

Question

我有一个句子列表。我想为它们添加填充；但是当我像这样使用 keras pad_sequence 时：

from keras.preprocessing.sequence import pad_sequences
s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype='str', maxlen=10, value='_PAD_')

结果是：

array([['_', '_', '_', '_', '_', '_', 't', 'i', 'a', 'b'],
       ['_', '_', '_', '_', '_', '_', '_', 't', 'i', 'n']], dtype='<U1')

为什么它不能正常工作？

我想将此结果用作 ELMO 嵌入的输入，并且我需要字符串句子而不是整数编码。

score 3 · Accepted Answer

更改dtype为object，它将为您完成工作。

from keras.preprocessing.sequence import pad_sequences

s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype=object, maxlen=10, value='_PAD_')
print(g)

输出：

array([['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'this',
        'is', 'a', 'book'],
       ['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_',
        'this', 'is', 'not']], dtype=object)

score -1 · Accepted Answer

文本应首先转换为数值。Keras 提供了分词器和两个方法 fit_on_texts 和 texts_to_sequences 来处理文本数据。

在此处参考此 keras 文档

Tokenizer ：这有助于向量化文本语料库，通过将每个文本转换为整数序列（每个整数是字典中标记的索引）或转换为向量，其中每个标记的系数可以是二进制的，基于单词数数

fit_on_texts：这会创建基于词频的词汇索引。

texts_to_sequences：这会将文本中的每个文本转换为整数序列。

from keras.preprocessing import text, sequence
s = ["this", "is", "a", "book", "of my choice"]
tokenizer = text.Tokenizer(num_words=100,lower=True)
tokenizer.fit_on_texts(s)
seq_token = tokenizer.texts_to_sequences(s)
g = sequence.pad_sequences(seq_token, maxlen=10)
g

输出

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 5, 6, 7]], dtype=int32)

keras - 字符串数据类型的 keras pad_sequence

2 回答 2

Related

Reference