python - TFX 组件 CsvExampleGen 总是产生带有空输出（和输入）的示例

Question

我可以在没有错误消息的情况下运行 CsvExampleGen，但生成的示例的输出（和输入）始终为空。

我正在使用 tfx==0.24.0。

要使用 CsvExampleGen 读取 CSV 文件，请根据文档和教程（包括https://www.tensorflow.org/tfx/guide/examplegen）+ tfx 0.23.0/0.24.0 的发行说明（https:// /github.com/tensorflow/tfx/releases），以下代码行应该足以读取 CVS 文件：

from tfx.components import CsvExampleGen
example_gen = CsvExampleGen(input_base=data_path)

其中“data_path”标识了一个包含 CVS 文件的目录。（请注意，该代码与官方文档的不同之处在于不使用“external_input”；而是遵循 0.23.0 发行说明中记录的新接口。）

从教程中我收集到一个简单的 CVS 文件应该足以进行测试（尽管我尝试了最多 7 个文件）。

我没有收到任何错误消息（如果我没有可用的 GPU，我被告知忽略的错误消息除外）；但是，结果结构的输出（和输入）是空的（分别为空列表和空集/字典）。但是，我认为它们不应该是空的。

有问题的 CSV 文件已找到并被触及，因为如果我在那里引入错误（例如一行中的附加列），我会收到一条错误消息。

我使用独立函数以及管道内部（为简单起见，使用 BeamDagRunner 运行）进行了尝试。该管道确实生成了一个 metadata.db，但我在那里找不到任何 CSV 数据的痕迹（如列名）。将 StatisticsGen 添加到管道并没有进一步帮助。

我用 iris 数据集尝试了这个，有和没有列标题。我还尝试在 data_path 中使用多达 7 个小的人工 CVS 文件，或者使用纯数字和混合数字/类别数据，或者使用逗号和分号作为分隔符。结果总是一样的。

我的代码有问题，或者某些配置或库有问题吗？

这是完整的代码（尽可能相关）：

PIPELINE_NAME = "X-pipeline-iris2"
BASE_PATH = r"C:\***\FX_Experiments"
BASE_PATH_PIPELINE = os.path.join(BASE_PATH, "pipeline")
BASE_PATH_TESTS = os.path.join(BASE_PATH, "tests")
PIPELINE_ROOT = os.path.join(BASE_PATH_PIPELINE, "output")
METADATA_PATH = os.path.join(BASE_PATH_PIPELINE, "tfx_metadata", PIPELINE_NAME, "metadata.db")
DATA_PATH = os.path.join(BASE_PATH_TESTS, "iris2")
ENABLE_CACHE = True


def create_pipeline(
        pipeline_name: Text, pipeline_root: Text, data_path: Text,
        enable_cache: bool,
        metadata_connection_config: Optional[metadata_store_pb2.ConnectionConfig] = None,
        beam_pipeline_args: Optional[List[Text]] = None
):
    components = []

    example_gen = CsvExampleGen(input_base=data_path)
    components.append(example_gen)

    stat_gen = StatisticsGen(examples=example_gen.outputs['examples'])
    components.append(stat_gen)

    return pipeline.Pipeline(
        pipeline_name = pipeline_name,
        pipeline_root = pipeline_root,
        components = components,
        enable_cache = enable_cache,
        metadata_connection_config = metadata_connection_config,
        beam_pipeline_args = beam_pipeline_args
    )

def run_pipeline():
    this_pipeline = create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_path=DATA_PATH,
        enable_cache=ENABLE_CACHE,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(METADATA_PATH)
    )
    BeamDagRunner().run(this_pipeline)

也可能有用：记录器信息：

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen depends on [].
INFO:absl:Component CsvExampleGen is scheduled.
INFO:absl:Component StatisticsGen depends on ['Run[CsvExampleGen]'].
INFO:absl:Component StatisticsGen is scheduled.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
...

score 0 · Accepted Answer

Felix，如果您遵循指南，您可能会在笔记本中运行您的代码。如果您想直接查看结果，则必须使用 InteractiveContext 启用 TFX 交互。

https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext

context = InteractiveContext()
example_gen = CsvExampleGen(input_base='/content/data')
context.run(example_gen)

python - TFX 组件 CsvExampleGen 总是产生带有空输出（和输入）的示例

1 回答 1

Related

Reference