有一个用于创建Transformer 聊天机器人的教程,它采用多个不同长度的按单词编码的句子列表,首先用 填充长度差异tf.keras.preprocessing
,然后从这些编码的句子创建数据集。
我试图首先创建数据集,然后用它填充和批处理它,dataset.padded_batch()
因为使用单个 API 看起来更有凝聚力。
我发现的问题是,如果没有事先填充,我会留下一个不同长度的列表列表(例如下面的编码问题列表),不能直接用于创建数据集(根据我的理解)。
Sample question: i really, really, really wanna go, but i can t. not unless my sister goes.
Sample answer: i m workin on it. but she doesn t seem to be goin for him.
Encoded sample question: [3, 224, 1, 224, 1, 154, 295, 180, 1, 42, 3, 32, 5335, 4, 31, 589, 27, 416, 1387, 5265]
List of encoded questions: [[5475, 32, 16, 106, 38, 2392, 25, 3796, 4313, 11, 5143, 5073, 34, 565, 108, 1099, 4422, 1278, 1929, 76, 45, 5, 3911, 4, 272, 5265, 5476], [5475, 77, 1, 3, 168, 16, 69, 246, 37, 2412, 1, 49, 13, 8, 1315, 37, 35, 5265, 5476], ...]
在不使用此类对象填充的情况下创建数据集的方法显然是使用tf.RaggedTensor
(s)。我现在可以使用 来创建数据集tf.data.Dataset.from_tensor_slices
,但之后无法填充数据集。
我得到的错误如下,这与我的数据集内容的指定形状有关(它是一维的,因为它只是不同长度的列表的列表):
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=({'inputs': (None, MAX_LENGTH),
'dec_inputs': (None, MAX_LENGTH)},
{'outputs': (None, MAX_LENGTH)}))
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/---/Desktop/IA/PyProjects/NLP_models/Natural Language Inference/transformer_chatbot.py", line 208, in <module>
{'outputs': (None, MAX_LENGTH)}))
File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 1097, in padded_batch
drop_remainder)
File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3341, in __init__
_padded_shape_to_tensor(padded_shape, input_component_shape))
File "C:\Users\----\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3269, in _padded_shape_to_tensor
% (padded_shape_as_shape, input_component_shape))
ValueError: The padded shape (None, 40) is not compatible with the corresponding input component shape (None,).
如何为 dataset.padded_batch() 提供一些参考形状,以便它可以用一些 MAX_LEN 填充第二维。还是没有 tf.keras.preprocessing 步骤的任何其他填充方式?