hpc - 如何使用 Globus CLI 让 Snakemake 识别 Globus 远程文件？

Question

我在高性能计算网格环境中工作，其中大规模数据传输是通过Globus完成的。我想使用 Snakemake 从 Globus 路径中提取数据，处理数据，然后将处理后的数据推送到不同的 Globus 路径。Globus 有一个命令行界面。

提取数据没有问题，因为我只需要创建一个规则来运行globus transfer以创建必要的本地文件。但是为了将数据推回 Globus，我想我需要一个规则来“看到”远程位置的文件丢失，然后向后工作以确定创建文件需要发生什么。

我可以创建代表远程文件的本地“代理”文件。例如，我可以制定在目录中创建“processed_data_1234.tar.gz”输出文件的规则。这些文件将只使用创建touch（因此为空），并且将运行相同的规则globus transfer以远程推送文件。但是，确保代理文件不会与真正的 Globus 托管文件不同步会产生开销。

有没有更优雅的方式来做到这一点，类似于远程文件功能？为 Snakemake 添加 Globus CLI 支持很难吗？提前感谢您的任何建议！

score 0 · Accepted Answer

是否有助于创建一个实用函数来生成所有所需文件的列表并将其与 globus 上可用的文件列表进行比较？像这样的东西（伪代码）：

def return_needed_files():
    list_needed_files = [] # either hard-coded or specified with some logic
    list_available = [] # as appropriate, e.g. using globus ls
    return [i for i in list_needed_files if i not in list_available]

# include all the needed files in the all rule
rule all:
    input: return_needed_files

hpc - 如何使用 Globus CLI 让 Snakemake 识别 Globus 远程文件？

1 回答 1

Related

Reference