kedro - 如何使用 Kedro 在云上读取/写入/同步数据

Question

简而言之：我如何在本地和云端保存文件，同样如何设置为从本地读取。

更长的描述：有两种场景，1）构建模型 2）通过 API 服务模型。在构建模型时，会进行一系列分析以生成特征和模型。结果将写入本地。最后，所有内容都将上传到 S3。为了提供数据，首先将下载第一步生成的所有必需文件。

我很好奇如何在这里利用 Kedro。也许我可以为每个文件定义两个条目，conf/base/catalog.yml一个对应于本地版本，第二个对应于 S3。但当我处理 20 个文件时，这可能不是最有效的方法。

或者，我可以使用自己的脚本将文件上传到 S3 并从 Kedro 中排除同步！换句话说，Kedro 对云上存在副本这一事实视而不见。也许这种方法不是对 Kedro 最友好的方法。

score 2 · Accepted Answer

不完全相同，但我在这里的回答可能有用。

我建议在您的情况下，最简单的方法确实是定义两个目录条目并将 Kedro 保存到它们（并从本地加载以提高速度），这为您提供了最大的灵活性，尽管我承认不是最漂亮的。

在避免所有需要返回两个值的节点函数方面，我建议将装饰器应用于您使用特定标签标记的某些节点，例如tags=["s3_replica"]从以下脚本中获取灵感（从我的同事那里窃取）：

class S3DataReplicationHook:
    """
    Hook to replicate the output of any node tagged with `s3_replica` to S3.

    E.g. if a node is defined as:
        node(
            func=myfunction,
            inputs=['ds1', 'ds2'],
            outputs=['ds3', 'ds4'],
            tags=['tag1', 's3_replica']
        )

    Then the hook will expect to see `ds3.s3` and `ds4.s3` in the catalog.
    """

    @hook_impl
    def before_node_run(
        self,
        node: Node,
        catalog: DataCatalog,
        inputs: Dict[str, Any],
        is_async: bool,
        run_id: str,
    ) -> None:
        if "s3_replica" in node.tags:
            node.func = _duplicate_outputs(node.func)
            node.outputs = _add_local_s3_outputs(node.outputs)


def _duplicate_outputs(func: Callable) -> Callable:
    def wrapped(*args, **kwargs):
        outputs = func(*args, **kwargs)
        return (outputs,) + (outputs,)

    return wrapped


def _add_local_s3_outputs(outputs: List[str]) -> List[str]:
    return outputs + [f'{o}.s3' for o in outputs]

以上是一个钩子，因此您可以将它放在hooks.py项目中的文件（或任何您想要的地方）中，然后将其导入您的settings.py文件并放入：

from .hooks import ProjectHooks, S3DataReplicationHook

hooks = (ProjectHooks(), S3DataReplicatonHook())

在你的settings.py.

您可以稍微巧妙地使用输出命名约定，以便它只复制某些输出（例如，也许您同意所有以结尾的目录条目.local也必须具有相应的.s3条目，并且您相应地改变了outputs您node的比对每个输出都这样做。

如果您想更聪明一点，您可以使用挂钩将相应的 S3 条目注入到目录中，after_catalog_created而不是再次按照您选择的命名约定在目录中手动编写数据集的 S3 版本。尽管我认为从长远来看，编写 S3 条目更具可读性。

score 1 · Accepted Answer

我能想到的方法有两种。一种更简单的方法是对云和本地都使用--envconf。https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments

conf
├── base
│   └── 
├── cloud
│   └── catalog.yml
└── my_local_env
    └── catalog.yml

您可以调用kedro run --env=cloud或kedro run --env=my_local取决于您要使用的环境。

另一种更高级的方法是使用 TemplatedConfigLoader https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html

conf
├── base
│   └── catalog.yml
├── cloud
│   └── globals.yml (contains `base_path:s3-prefix-path`)
└── my_local
    └── globals.yml (contains `base_path:my_local_path`)

中catalog.yml，可以base_path这样引用

my_dataset:
    filepath: s3:${base_path}/my_dataset

您可以调用kedro run --env=cloud或kedro run --env=my_local取决于您要使用的环境。

kedro - 如何使用 Kedro 在云上读取/写入/同步数据

2 回答 2

Related

Reference