我想用 Tensorflow 实现一个带有 CTC 损失的语音识别器。输入特征具有可变长度,因为每个语音话语都可以具有可变长度。标签也具有可变长度,因为每个转录是不同的。我手动填充特征来创建批次,在我的模型中我有 tf.keras.layers.Masking() 层来创建和传播掩码通过网络。我还创建了带有填充的标签批次。
这是一个虚拟示例。假设我有两个长度分别为 3 帧和 5 帧的话语。每帧由一个特征表示(通常这将是 13 个 MFCC,但我将其减少到一个以保持简单)。因此,为了创建批处理,我在最后用 0 填充简短的话语:
features = np.array([1.5 2.3 4.6 0.0 0.0],
[1.7 2.6 3.4 2.3 1.0])
标签是这些话语的转录。假设长度分别为 2 和 3。标签批形状将为 [2, 3, 26],其中批大小为 2,最大长度为 3,英文字符数为 26(单热编码)。
型号为:
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
损失函数类似于:
def ctc_loss(y_true, y_pred):
# Do something here to get logit_length and label_length?
# ...
loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
我的问题是如何获得 logit_length 和 label_length。我会假设 logit_length 是在掩码中编码的,但是如果我执行 y_pred._keras_mask,则结果为 None。对于 label_length,信息在张量本身中,但我不确定获取它的最有效方式。
谢谢。
更新:
按照Tou You的回答,我使用tf.math.count_nonzero来获取label_length,并将logit_length设置为logit层的长度。
所以损失函数内部的形状是(batch size = 10):
y_true.shape = (10, None)
y_pred.shape = (10, None, 27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)
当然 y_true 和 y_pred 的 'None' 是不一样的,因为一个是批次的最大字符串长度,另一个是批次的最大时间帧数。但是,当我使用这些参数调用 model.fit() 和 loss tf.keras.backend.ctc_batch_cost() 时,我收到错误:
Traceback (most recent call last):
File "train.py", line 164, in <module>
model.fit(dataset, batch_size=batch_size, epochs=10)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
return self._call_flat(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
(1) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
[[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]
Function call stack:
train_function -> train_function
看起来它在抱怨 y_true (92) 的长度与 y_pred (876) 的长度不同,我认为这不应该。我错过了什么?