还在https://github.com/tensorflow/transform/issues/261上发布了问题
我在 TFX 中使用 tft,需要将字符串列表类标签转换为内部的多热指标preprocesing_fn
。本质上:
vocab = tft.vocabulary(inputs['label'])
outputs['label'] = tf.cast(
tf.sparse.to_indicator(
tft.apply_vocabulary(inputs['label'], vocab),
vocab_size=VOCAB_SIZE,
),
"int64",
)
我试图从 vocab 的结果中获取 VOCAB_SIZE,但找不到满足延迟执行和已知形状的方法。我在下面得到的最接近的不会通过保存的模型导出,因为标签的形状是未知的。
def _make_table_initializer(filename_tensor):
return tf.lookup.TextFileInitializer(
filename=filename_tensor,
key_dtype=tf.string,
key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
value_dtype=tf.int64,
value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
)
def _vocab_size(deferred_vocab_filename_tensor):
initializer = _make_table_initializer(deferred_vocab_filename_tensor)
table = tf.lookup.StaticHashTable(initializer, default_value=-1)
table_size = table.size()
return table_size
deferred_vocab_and_filename = tft.vocabulary(inputs['label'])
vocab_applied = tft.apply_vocabulary(inputs['label'], deferred_vocab_and_filename)
vocab_size = _vocab_size(deferred_vocab_and_filename)
outputs['label'] = tf.cast(
tf.sparse.to_indicator(vocab_applied, vocab_size=vocab_size),
"int64",
)
得到
ValueError: Feature label (Tensor("Identity_3:0", shape=(None, None), dtype=int64)) had invalid shape (None, None) for FixedLenFeature: apart from the batch dimension, all dimensions must have known size [while running 'Analyze/CreateSavedModel[tf_v2_only]/CreateSavedModel']
知道如何实现这一目标吗?