0

一些背景:

我将数据结构化为形状的 TFIDF 向量,(15637, 31635)这是该向量到LSTM层的输入。我的词汇表中最长的单词是305单词,每个TFIDF向量都有长度31635,因为语料库中的总词汇表有这么多单词。

每一个都是 form15637 sentencesTFIDF向量(31635, )

我使用的是 TFIDF 而不是预训练的embedding层。

No_of_sentences = 15637

BATCH_SIZE = 64

steps_per_epoch = 15637/64 = 244 (with remainder dropped)

vocab_inp_size = 31635. #These were tokens created by Keras tokenizer. and are the distinct words in the input corpus

vocab_tar_size = 4. #This is One-Hot encoding of target value.

下面的代码首先创建tensor slices,然后批处理tensor slices,最后enumerates每个batch都给出一个tuple表单:(batch, (input_tensor, target_tensor))

dataset = tf.data.Dataset.from_tensor_slices((input_tfidfVector, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # this is where batching happens

for (batch) in enumerate(dataset.take(steps_per_epoch)):`
   print (batch) #this will print the tuple: curent batch (batch 0) but also the input and the target tensor 

(0, (<tf.Tensor: shape=(64, 31635), dtype=float64, numpy=
array([[0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 3.35343652, ..., 0.        , 0.        ,
        0.        ]])>, <tf.Tensor: shape=(64, 1), dtype=int32, numpy=
array([[3],
       [1],
       [2],
       [1],
       [3],
       [1],
       [1],
       [1],
       [1],
       [2],
       [2],
       [2],
       [3],
       [2],
       [2],
       [2],
       [2],
       [2],
       [1],
       [2],
       [1],
       [2],
       [3],
       [2],
       [3],
       [1],
       [1],
       [1],
       [3],
       [1],
       [1],
       [2],
       [2],
       [2],
       [2],
       [2],
       [2],
       [3],
       [3],
       [1],
       [1],
       [3],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [3],
       [2],
       [1],
       [3],
       [1],
       [3],
       [3],
       [1],
       [2],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [1]], dtype=int32)>))

问题:

我没有使用预训练的嵌入层——而是每个句子的 TFIDF 向量。我不会从输入中删除停用词 - 因此 TFIDF 会降低语料库中过于频繁的任何词的权重。

假设我只使用由 keras 标记器创建的标记(而不是像上面解释的那样对句子使用 TFIDF 向量)。从理论上讲,这是一个不错的选择..你怎么看?

注意:31635 是语料库的大小(所有句子中的单词数)。所以每个句子的长度为 31635,但它大部分是稀疏的(填充),因为我输入的最长句子大约是 300 个单词。

4

0 回答 0