1

我已经为我的原始数据实现了一个自定义的 TensorFlow 数据集。tensorflow.data.Dataset我可以按如下方式下载、准备和加载数据:

import tensorflow_datasets

builder = tensorflow_datasets.builder("my_dataset")
builder.download_and_prepare()
ds = builder.as_dataset()

我想在TensorFlow 转换管道中转换这些数据以进行模型训练。但是,我能够将数据集传递到转换管道的唯一方法是将其转换为实例字典并传入原始数据元数据。

instance_dicts = tensorflow_datasets.as_dataframe(ds).to_dict(orient="records")
with tensorflow_transform.beam.Context():
    (transformed_data, _), transform_fn = (
        instance_dicts,
        RAW_DATA_METADATA,
    ) | tensorflow_transform.beam.AnalyzeAndTransformDataset(
        preprocessing_fn, output_record_batches=True
    )

将 TensorFlow 数据集传递到 TensorFlow 转换管道是否有更简单、内存效率更高的方法?

4

1 回答 1

0

在这种情况下,将 Tensorflow 数据集传递到 Tensorflow 转换管道的更简单且内存效率更高的方法是引用由 Tensorflow 数据集生成器的download_and_prepare()作业编写的 TFRecord 文件。

import apache_beam
from apache_beam.io import tfrecordio

examples_dir = tensorflow_datasets.builder("my_dataset").info.data_dir
examples_file_pattern = f"{examples_dir}/my_dataset-*"

with apache_beam.Pipeline() as pipeline:
    with tensorflow_transform.beam.Context():
        raw = pipeline | tfrecordio.ReadFromTFRecord(file_pattern=examples_file_pattern)

要转换原始数据,请根据功能规范创建 TFXIO。

from tensorflow_transform.tf_metadata.schema_utils import schema_from_feature_spec
from tfxio.public import tfxio

example_spec = {
    "token": tensorflow.io.FixedLenFeature([], tensorflow.string),
    "label": tensorflow.io.FixedLenFeature([], tensorflow.string),
}
schema = schema_from_feature_spec(example_spec)
tfexample_tfxio = tfxio.TFExampleBeamRecord(physical_format=["tfrecord"], schema=schema)

然后在管道中,将 转换PCollection为 Beam 源,并提供适当的适配器以转换RecordBatch为张量。

        # ...
        (transformed_data, _), transform_fn = (
            (raw | tfexample_tfxio.BeamSource()),
            tfexample_tfxio.TensorAdapterConfig(),
        ) | tensorflow_transform.beam.AnalyzeAndTransformDataset(
            preprocessing_fn, output_record_batches=True
        )
于 2021-12-01T17:16:17.203 回答