python-3.x - 使用 TensorFlow Extended (TFX) 进行多输出分类

Question

我对 TFX（TensorFlow Extended）很陌生，并且一直在阅读 TensorFlow 门户上的示例教程，以了解更多信息，以便将其应用于我的数据集。

在我的场景中，手头的问题不是预测单个标签，而是需要我预测 2 个输出（类别 1，类别 2）。

我已经使用纯 TensorFlow Keras 功能 API 完成了这项工作，并且效果很好，但现在我正在寻找是否可以将其安装到 TFX 管道中。

我得到错误的地方是管道的Trainer阶段，它引发错误的地方是_input_fn，我怀疑这是因为我没有正确地将给定数据拆分为（特征、标签）张量对管道。

设想：

每行输入数据的形式为 [Col1, Col2, Col3, ClassificationA, ClassificationB]
ClassificationA 和 ClassificationB 是我尝试使用 Keras 功能模型预测的分类标签

keras 功能模型的输出层如下所示，其中有 2 个输出连接到单个密集层（注意：附加到末尾的 _xf 只是为了说明我已将类编码为 int 表示）

output_1 = tf.keras.layers.Dense(TargetA_Class, activation='sigmoid', name = 'ClassificationA_xf')(dense)

output_2 = tf.keras.layers.Dense(TargetB_Class, activation='sigmoid', name = 'ClassificationB_xf')(dense)

模型 = tf.keras.Model（输入 = 输入，输出 = [输出_1，输出_2]）

在培训模块文件中，我在模块文件的开头导入了所需的包>

import tensorflow_transform as tft
from tfx.components.tuner.component import TunerFnResult
import tensorflow as tf
from typing import List, Text
from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor, FnArgs
from tfx_bsl.tfxio import dataset_options

培训模块文件中的当前input_fn如下所示（按照教程进行操作）

def _input_fn(file_pattern: List[Text],
              data_accessor: DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              batch_size: int = 200) -> tf.data.Dataset:
  """Helper function that Generates features and label dataset for tuning/training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    tf_transform_output: A TFTransformOutput.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
      
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      dataset_options.TensorFlowDatasetOptions(
          batch_size=batch_size, 
          #label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]),
          label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]), _transformed_name(_CATEGORICAL_LABEL_KEYS[1])),
      tf_transform_output.transformed_metadata.schema)

当我运行培训师组件时，出现的错误是：

label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]),transformed_name(_CATEGORICAL_LABEL_KEYS 1 )),

^ SyntaxError: 位置参数跟随关键字参数

我也试过label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS])也给出了错误。

但是，如果我只传入一个标签键label_key=transformed_name(_CATEGORICAL_LABEL_KEYS[0])那么它工作正常。

仅供参考 - _CATEGORICAL_LABEL_KEYS 只不过是一个列表，其中包含我试图预测的 2 个输出的名称（ClassificationA，ClassificationB）。

transform_name 只不过是一个为转换后的数据返回更新的名称/键的函数：

def transformed_name(key):
  return key + '_xf'

问题：

据我所知，dataset_options.TensorFlowDatasetOptions 的 label_key 参数只能接受标签的单个字符串/名称，这意味着它可能无法输出具有多个标签的数据集。

有没有一种方法可以修改_input_fn以便我可以获取_input_fn返回的数据集以返回 2 个输出标签？所以返回的张量看起来像：

Feature_Tensor: {Col1_xf: Col1_transformedfeature_values, Col2_xf: Col2_transformedfeature_values, Col3_xf: Col3_transformedfeature_values}

Label_Tensor：{ClassificationA_xf：ClassA_encodedlabels，ClassificationB_xf：ClassB_encodedlabels}

希望得到更广泛的 tfx 社区的建议！

score 0 · Accepted Answer

由于标签键是可选的，可能不是在 TensorflowDatasetOptions 中指定它，而是您可以在dataset.map之后使用并在从数据集中获取它们后传递两个标签。

没有测试过，但类似：

def _data_augmentation(feature_dict):
  features = feature_dict[_transformed_name(x) for x in 
  _CATEGORICAL_FEATURE_KEYS]]
  keys=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]

  return features, keys
  

def _input_fn(file_pattern: List[Text],
              data_accessor: DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              batch_size: int = 200) -> tf.data.Dataset:
  """Helper function that Generates features and label dataset for tuning/training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    tf_transform_output: A TFTransformOutput.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
      
  """
  dataset = data_accessor.tf_dataset_factory(
      file_pattern,
      dataset_options.TensorFlowDatasetOptions(
        batch_size=batch_size, 
        tf_transform_output.transformed_metadata.schema)

  dataset = dataset.map(_data_augmentation)
  return dataset

python-3.x - 使用 TensorFlow Extended (TFX) 进行多输出分类

1 回答 1

Related

Reference