python - 一个 distcp 命令将多个文件上传到 s3（无目录）

Question

我目前正在使用 Hadoop/HDFS 的 s3a 适配器，以允许我将许多文件从 Hive 数据库上传到特定的 s3 存储桶。我越来越紧张，因为我在网上找不到任何关于通过 distcp指定一堆文件路径（而不是目录）进行复制的信息。

我已经设置我的程序使用函数收集文件路径数组，将它们全部注入 distcp 命令，然后运行命令：

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"

logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)

这基本上只是创建了一个具有 15-20 个不同文件路径的长 distcp 命令。这行得通吗？我应该使用-cpor-put命令而不是distcp？

（当我可以直接复制它们并跳过这些步骤时，将所有这些文件复制到他们自己的目录然后对整个目录进行 distcp 对我来说没有意义......）

score 1 · Accepted Answer

-cp并-put要求您下载 HDFS 文件，然后上传到 S3。那会慢很多。

我看不出这不起作用的直接原因，但是，阅读文档后，我建议使用-fflag 代替。

例如

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
    for file in files:
        f.write(f'hdfs://nameservice1{file}\n')

s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)

如果所有文件都已经在他们自己的目录中，那么你应该像你说的那样复制目录。

python - 一个 distcp 命令将多个文件上传到 s3（无目录）

1 回答 1

Related

Reference