python - TF-IDF 向量与标记向量

Question

一些背景：

我将数据结构化为形状的 TFIDF 向量，(15637, 31635)这是该向量到LSTM层的输入。我的词汇表中最长的单词是305单词，每个TFIDF向量都有长度31635，因为语料库中的总词汇表有这么多单词。

每一个都是 form15637 sentences的TFIDF向量(31635, )。

我使用的是 TFIDF 而不是预训练的embedding层。

No_of_sentences = 15637

BATCH_SIZE = 64

steps_per_epoch = 15637/64 = 244 (with remainder dropped)

vocab_inp_size = 31635. #These were tokens created by Keras tokenizer. and are the distinct words in the input corpus

vocab_tar_size = 4. #This is One-Hot encoding of target value.

下面的代码首先创建tensor slices，然后批处理tensor slices，最后enumerates每个batch都给出一个tuple表单：(batch, (input_tensor, target_tensor))。

dataset = tf.data.Dataset.from_tensor_slices((input_tfidfVector, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # this is where batching happens

for (batch) in enumerate(dataset.take(steps_per_epoch)):`
   print (batch) #this will print the tuple: curent batch (batch 0) but also the input and the target tensor 

(0, (<tf.Tensor: shape=(64, 31635), dtype=float64, numpy=
array([[0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 3.35343652, ..., 0.        , 0.        ,
        0.        ]])>, <tf.Tensor: shape=(64, 1), dtype=int32, numpy=
array([[3],
       [1],
       [2],
       [1],
       [3],
       [1],
       [1],
       [1],
       [1],
       [2],
       [2],
       [2],
       [3],
       [2],
       [2],
       [2],
       [2],
       [2],
       [1],
       [2],
       [1],
       [2],
       [3],
       [2],
       [3],
       [1],
       [1],
       [1],
       [3],
       [1],
       [1],
       [2],
       [2],
       [2],
       [2],
       [2],
       [2],
       [3],
       [3],
       [1],
       [1],
       [3],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [3],
       [2],
       [1],
       [3],
       [1],
       [3],
       [3],
       [1],
       [2],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [1]], dtype=int32)>))

问题：

我没有使用预训练的嵌入层——而是每个句子的 TFIDF 向量。我不会从输入中删除停用词 - 因此 TFIDF 会降低语料库中过于频繁的任何词的权重。

假设我只使用由 keras 标记器创建的标记（而不是像上面解释的那样对句子使用 TFIDF 向量）。从理论上讲，这是一个不错的选择..你怎么看？

注意：31635 是语料库的大小（所有句子中的单词数）。所以每个句子的长度为 31635，但它大部分是稀疏的（填充），因为我输入的最长句子大约是 300 个单词。

python - TF-IDF 向量与标记向量

0 回答 0

Related

Reference