一些背景:
我将数据结构化为形状的 TFIDF 向量,(15637, 31635)
这是该向量到LSTM
层的输入。我的词汇表中最长的单词是305
单词,每个TFIDF
向量都有长度31635
,因为语料库中的总词汇表有这么多单词。
每一个都是 form15637 sentences
的TFIDF
向量(31635, )
。
我使用的是 TFIDF 而不是预训练的embedding
层。
No_of_sentences = 15637
BATCH_SIZE = 64
steps_per_epoch = 15637/64 = 244 (with remainder dropped)
vocab_inp_size = 31635. #These were tokens created by Keras tokenizer. and are the distinct words in the input corpus
vocab_tar_size = 4. #This is One-Hot encoding of target value
.
下面的代码首先创建tensor slices
,然后批处理tensor slices
,最后enumerates
每个batch
都给出一个tuple
表单:(batch, (input_tensor, target_tensor))
。
dataset = tf.data.Dataset.from_tensor_slices((input_tfidfVector, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # this is where batching happens
for (batch) in enumerate(dataset.take(steps_per_epoch)):`
print (batch) #this will print the tuple: curent batch (batch 0) but also the input and the target tensor
(0, (<tf.Tensor: shape=(64, 31635), dtype=float64, numpy=
array([[0. , 1.74502835, 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 1.74502835, 0. , ..., 0. , 0. ,
0. ],
[0. , 1.74502835, 3.35343652, ..., 0. , 0. ,
0. ]])>, <tf.Tensor: shape=(64, 1), dtype=int32, numpy=
array([[3],
[1],
[2],
[1],
[3],
[1],
[1],
[1],
[1],
[2],
[2],
[2],
[3],
[2],
[2],
[2],
[2],
[2],
[1],
[2],
[1],
[2],
[3],
[2],
[3],
[1],
[1],
[1],
[3],
[1],
[1],
[2],
[2],
[2],
[2],
[2],
[2],
[3],
[3],
[1],
[1],
[3],
[1],
[1],
[1],
[2],
[1],
[1],
[3],
[2],
[1],
[3],
[1],
[3],
[3],
[1],
[2],
[1],
[1],
[1],
[2],
[1],
[1],
[1]], dtype=int32)>))
问题:
我没有使用预训练的嵌入层——而是每个句子的 TFIDF 向量。我不会从输入中删除停用词 - 因此 TFIDF 会降低语料库中过于频繁的任何词的权重。
假设我只使用由 keras 标记器创建的标记(而不是像上面解释的那样对句子使用 TFIDF 向量)。从理论上讲,这是一个不错的选择..你怎么看?
注意:31635 是语料库的大小(所有句子中的单词数)。所以每个句子的长度为 31635,但它大部分是稀疏的(填充),因为我输入的最长句子大约是 300 个单词。